Version issues for training

automl / TabPFN

Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.01848) and the tabpfn package.

http://priorlabs.ai

Apache License 2.0

1.22k stars 109 forks source link

Version issues for training #25

Closed amueller closed 1 year ago

amueller commented 1 year ago

I've tried running the PriorFittingCustomPrior.ipynb and run into some difficulties. It seems lightgbm is getting imported, but it's not part of the pyproject.toml. Also, seaborn '0.12.2' raises an error when plotting:

ValueError: The following variable cannot be assigned with wide-form data: `hue`

It would be awesome to get a conda environment with a working config, I also didn't see the python3.7 requirement at first, since it's only mentioned in the requirements.txt.

amueller commented 1 year ago

Downgrading to seaborn 0.11 yields:

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I had changed device to 'cuda', changing it to 'cpu' makes it work.

amueller commented 1 year ago

another question: is there a way to do multi-gpu training using your scripts from the notebook you provide? I don't see any code to spawn workers, it looks like init_dist requires using torchrun?

SamuelGabriel commented 1 year ago

Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)

Yes, there is. We used submitit to run all our experiments, since we have a SLURM cluster. Our parallelization is heavily inspired by this repo: https://github.com/facebookresearch/dino

If you have a SLURM cluster: You can make the train call with executor.submit you can simply update the parameters of ex to schedule a multi gpu job:

    executor.update_parameters(
        gpus_per_node=8,
        tasks_per_node=8,  # one task per GPU
    )

If not: launching with torchrun should also work out of the box, as I wrote some code to handle it, but I am not 100% sure, as we did not use this in a while.

The important code is here: https://github.com/automl/TabPFN/blob/f02c093c101f80cb4f462f834c22456bbd3c1e84/tabpfn/utils.py#L238

The code does not support multi-node trainings, though.

amueller commented 1 year ago

Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)

Oh, I thought maybe the requirements file was consume by the setup.py as the installation instructions only mention the pip install. It would be great to have end-to-end instructions for reproducing the training.

Thanks for the pointer to sumitit, I'll check out how it works. I don't have a slurm cluster, I have a cloud ;) I'm currently using torchrun.

SamuelGabriel commented 1 year ago

Did you get this far, installing from pip? I did not expect this to work tbh and thought one needs to install from requirements to train. I will add the requirement to the setup, thanks! :)

amueller commented 1 year ago

Oh yeah I didn't touch the requirements.txt, it wasn't mentioned anywhere.

I think adding requirements.txt to setup is a bad habit, but many people do it. Having maybe one section for installing for using the model and one for reproducing the training would be great.

SamuelGabriel commented 1 year ago

Yeah, I won’t add the full requirements. No worries :) I will just add the seaboard <=0.12

On 30. Jan 2023, at 18:46, Andreas Mueller @.***> wrote:

Oh yeah I didn't touch the requirements.txt, it wasn't mentioned anywhere.

I think adding requirements.txt to setup is a bad habit, but many people do it. Having maybe one section for installing for using the model and one for reproducing the training would be great.

— Reply to this email directly, view it on GitHub https://github.com/automl/TabPFN/issues/25#issuecomment-1409053195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACK7PSNMWYAPOWZX5IDDYHDWU746HANCNFSM6AAAAAAUJC27OA. You are receiving this because you modified the open/close state.

SamuelGabriel commented 1 year ago

No <=0.11

On 30. Jan 2023, at 18:47, Samuel M @.***> wrote:

Yeah, I won’t add the full requirements. No worries :) I will just add the seaboard <=0.12

On 30. Jan 2023, at 18:46, Andreas Mueller @. @.>> wrote:

Oh yeah I didn't touch the requirements.txt, it wasn't mentioned anywhere.

I think adding requirements.txt to setup is a bad habit, but many people do it. Having maybe one section for installing for using the model and one for reproducing the training would be great.

— Reply to this email directly, view it on GitHub https://github.com/automl/TabPFN/issues/25#issuecomment-1409053195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACK7PSNMWYAPOWZX5IDDYHDWU746HANCNFSM6AAAAAAUJC27OA. You are receiving this because you modified the open/close state.