Debug autogluon pytorch cpu/cuda version

giswqs commented 11 months ago

Comment:

@h-vetinari @PertuyF @dhirschfeld @ngam @arturdaraujo Thank you all for your help with the autoguon conda-forge packages earlier this year. We recently ran into a strange issue with the autogluon conda-forge installation. @suzhoum and I have spent a few day debugging the issue but still could not figure it out yet. The issue is that if we specifically add pytorch in the mamba installation command, it can install the pytorch cuda version properly. Without pytorch in the installation command, it will only install the cpu version. We would great appreciate your advice on this issue.

Create a conda env

conda create -n ag python=3.11
conda activate ag
conda install -c conda-forge mamba

This installs the pytorch cpu version

mamba install -c conda-forge autogluon

This installs the pytorch cuda version

mamba install -c conda-forge autogluon pytorch

ngam commented 11 months ago

I am not sure how we could resolve. In general, the recommendation is to be explicit if one wants cuda. Tagging @hmaarrfk who may have additional insights on the latest tips and tricks

dhirschfeld commented 11 months ago

I guess because autogluon doesn't depend on cuda itself that it by default picks the CPU variant. I'm not sure if there's a way to prefer the GPU variant without explicitly specifying it in some way. Maybe there is - I'm not at all familiar with this aspect of mamba

giswqs commented 11 months ago

This command can install the pytorch cuda verson properly:

mamba install -c conda-forge pytorch

Since autogluon depends on pytorch, I would expect this command to install the pytorch cuda version properly, but it doesn't.

mamba install -c conda-forge autogluon

This command can install the pytorch cuda version properly, but I just could not understand why. Since autogluon depends on pytorch, install autogluon pytorch and install autogluon should make no difference, but it is not the case here.

mamba install -c conda-forge autogluon pytorch

dhirschfeld commented 11 months ago

I'm not sure myself, maybe @wolfv might know whether this is expected behaviour with mamba

ngam commented 11 months ago

This is likely a corner case of how __cuda arch spec was design; @wolfv is definitely the one to know all the details (about what is potentially tripping up the solver here) 👀

@giswqs for completeness, and if you don't mind, could you test the behavior with conda and micromamba? Or I can test it if I manage to get an allocation before you manage to have your copies of micromamba/conda

giswqs commented 11 months ago

@ngam Thanks for the suggestion. I will give it a try tomorrow. It is midnight here now. Off to bed shortly.

ngam commented 11 months ago

Okay, I checked on a cluster with gpus. I believe the problem here is that cuda120 takes high precedence over cuda118, thus xgboost in your env will get cuda120, but that conflicts with the pytorch cuda 120, so it gives you out the cpu version. All of this is due to something in autogluon (deps-wise). Here's a readout:

micromamba create -n test_ag_2_mic2 autogluon pytorch=*=*cuda120*
conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ autogluon is installable with the potential options
    │  ├─ autogluon 0.6.2 would require
    │  │  └─ autogluon.timeseries 0.6.2 , which requires
    │  │     └─ pytorch <1.13,>=1.9 , which can be installed;
    │  ├─ autogluon [0.7.0|0.8.0|0.8.1] would require
    │  │  └─ autogluon.timeseries [0.7.0 |0.8.0 |0.8.1 ], which requires
    │  │     └─ pytorch >=1.9,<1.14 , which can be installed;
    │  ├─ autogluon 0.8.2 would require
    │  │  ├─ autogluon.multimodal 0.8.2  with the potential options
    │  │  │  ├─ autogluon.multimodal 0.8.2 would require
    │  │  │  │  └─ pytorch >=1.9,<1.14 , which can be installed;
    │  │  │  └─ autogluon.multimodal 0.8.2 would require
    │  │  │     ├─ pytorch >=2.0,<2.1  with the potential options
    │  │  │     │  ├─ pytorch 2.0.0 conflicts with any installable versions previously reported;
    │  │  │     │  ├─ pytorch 2.0.0, which can be installed;
    │  │  │     │  ├─ pytorch 2.0.0, which can be installed;
    │  │  │     │  └─ pytorch 2.0.0, which can be installed;
    │  │  │     └─ torchvision >=0.15.0,<0.16.0  with the potential options
    │  │  │        ├─ torchvision [0.15.1|0.15.2] would require
    │  │  │        │  └─ pytorch * cpu*, which can be installed;
    │  │  │        ├─ torchvision [0.15.1|0.15.2] would require
    │  │  │        │  └─ pytorch [2.0 cuda112*|2.0.* cuda112*], which can be installed;
    │  │  │        └─ torchvision 0.15.2 would require
    │  │  │           └─ pytorch 2.0 cpu*, which can be installed;
    │  │  ├─ autogluon.tabular 0.8.2  with the potential options
    │  │  │  ├─ autogluon.tabular 0.8.2 would require
    │  │  │  │  └─ pytorch >=1.9,<1.14 , which can be installed;
    │  │  │  └─ autogluon.tabular 0.8.2 would require
    │  │  │     └─ pytorch >=1.13,<2.1  with the potential options
    │  │  │        ├─ pytorch 2.0.0 conflicts with any installable versions previously reported;
    │  │  │        ├─ pytorch [1.13.0|1.13.1], which can be installed;
    │  │  │        ├─ pytorch 2.0.0, which can be installed;
    │  │  │        ├─ pytorch 2.0.0, which can be installed;
    │  │  │        └─ pytorch 2.0.0, which can be installed;
    │  │  └─ autogluon.timeseries 0.8.2  with the potential options
    │  │     ├─ autogluon.timeseries [0.7.0|0.8.0|0.8.1|0.8.2], which can be installed (as previously explained);
    │  │     └─ autogluon.timeseries 0.8.2 would require
    │  │        └─ pytorch >=1.13,<2.1  with the potential options
    │  │           ├─ pytorch 2.0.0 conflicts with any installable versions previously reported;
    │  │           ├─ pytorch [1.13.0|1.13.1], which can be installed;
    │  │           ├─ pytorch 2.0.0, which can be installed;
    │  │           ├─ pytorch 2.0.0, which can be installed;
    │  │           └─ pytorch 2.0.0, which can be installed;
    │  └─ autogluon 1.0.0 would require
    │     ├─ autogluon.multimodal 1.0.0 , which requires
    │     │  └─ torchvision >=0.15.0,<0.16.0 , which can be installed (as previously explained);
    │     └─ autogluon.timeseries 1.0.0 , which requires
    │        └─ pytorch >=2.0,<2.1  with the potential options
    │           ├─ pytorch 2.0.0 conflicts with any installable versions previously reported;
    │           ├─ pytorch 2.0.0, which can be installed;
    │           ├─ pytorch 2.0.0, which can be installed;
    │           └─ pytorch 2.0.0, which can be installed;
    └─ pytorch * *cuda120* is not installable because it conflicts with any installable versions previously reported.
critical libmamba Could not solve for environment specs

ngam commented 11 months ago

Hope this makes sense. I am not super familiar with the solver, but that's my interpretation. When you specify "pytorch" for the solver (e.g., in the call), it then takes precedence over xgboost and others, and so you get the cuda118 versions of all (because that's the highest available option for your env with pytorch taking highest higher precendee over things like xgboost). Note that the cuda120 versions of pytorch and xgboost don't conflict (i.e., micromamba create -n test xgboost pytorch yields cuda120 versions of both in harmony)

h-vetinari commented 11 months ago

@ngam's analysis looks solid to me. CUDA migrations should be becoming smoother in the future (much-improved setup as of CUDA 12), but for now, the best way to solve this would be to figure out which dependencies aren't built for CUDA 12 yet, and help those get built.

giswqs commented 11 months ago

@ngam Thank you very much for looking into it. Your explanation makes a lot of sense.

It appears that the autogluon.multimodal dependency restriction causes this issue.

torchvision >=0.15.0,<0.16.0

Only torchvision v0.16.1 supports cuda 120. If autogluon.multimodal can increase the torchvision version upper bound to 0.16.1, mamba install -c conda-forge autogluon should be able to pull in cuda120.

conda-forge / autogluon-feedstock

Debug autogluon pytorch cpu/cuda version #27

Comment: