got stuck in somethere when try start a experiment.

zhaoedf commented 6 months ago

i set up the environment following the section installation in README.

i launch the demo experiment using command "python main.py strategy=er experiment=split_cifar100" and the program is frozen at this stage as shown in the image.

how can i solve this problem? thanks a lot.

AlbinSou commented 6 months ago

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

zhaoedf commented 6 months ago

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

i have gpu acess.

i manage to find the problem. it seems that in some case (e.g. using CUDA_VISIBLE_DEVICES=X), it would be frozen and cannot start training. if add a line of code, training will not be frozen, but can not use GPU. strategy_dict['device'] = 'cpu' # add cl_strategy = globals()[strategy](**strategy_dict, plugins=plugins)

Since avalanche is like pytorch-lightning using so many loops, it's hard to debug. Any advice? thank you for kind reply!

zhaoedf commented 6 months ago

looks like

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

i have gpu acess.

i manage to find the problem. it seems that in some case (e.g. using CUDA_VISIBLE_DEVICES=X), it would be frozen and cannot start training. if add a line of code, training will not be frozen, but can not use GPU. strategy_dict['device'] = 'cpu' # add cl_strategy = globals()[strategy](**strategy_dict, plugins=plugins)

Since avalanche is like pytorch-lightning using so many loops, it's hard to debug. Any advice? thank you for kind reply!

looks like the installation in README you privided will cause CUDA compatability issue on my machine. simply run following installation command will solve this problem:

conda create -n avalanche python=3.10
conda env config vars set PYTHONPATH=/home/.../ocl_survey
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install avalanche-lib
... # install other missing packages as prompting

AlbinSou commented 6 months ago

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

i have gpu acess.

i manage to find the problem. it seems that in some case (e.g. using CUDA_VISIBLE_DEVICES=X), it would be frozen and cannot start training. if add a line of code, training will not be frozen, but can not use GPU. strategy_dict['device'] = 'cpu' # add cl_strategy = globals()[strategy](**strategy_dict, plugins=plugins)

Since avalanche is like pytorch-lightning using so many loops, it's hard to debug. Any advice? thank you for kind reply!

Yes, not using GPU is not an acceptable solution. I hope that solving the CUDA compatibility issue did the job.

As for debugging, it really depends. I sometimes create a "debug" plugin that calls PDB in one of the callback that way I can inspect the state of the strategy at a given callback, without having to call pdb directly in the source code of avalanche. However, it's true that it makes it a bit more complicated than usual. And for the kind of issue that you just had (stuck in somewhere), it does not really help, or you have to find function by function where the program got stuck.

zhaoedf commented 6 months ago

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

i have gpu acess. i manage to find the problem. it seems that in some case (e.g. using CUDA_VISIBLE_DEVICES=X), it would be frozen and cannot start training. if add a line of code, training will not be frozen, but can not use GPU. strategy_dict['device'] = 'cpu' # add cl_strategy = globals()[strategy](**strategy_dict, plugins=plugins) Since avalanche is like pytorch-lightning using so many loops, it's hard to debug. Any advice? thank you for kind reply!

Yes, not using GPU is not an acceptable solution. I hope that solving the CUDA compatibility issue did the job.

As for debugging, it really depends. I sometimes create a "debug" plugin that calls PDB in one of the callback that way I can inspect the state of the strategy at a given callback, without having to call pdb directly in the source code of avalanche. However, it's true that it makes it a bit more complicated than usual. And for the kind of issue that you just had (stuck in somewhere), it does not really help, or you have to find function by function where the program got stuck.

my problem is solved by reinstall the avalanche env using the commands listed above. if that's ok for you, you can close this issue.

by the way, would you mind sharing your "debug plugin", i think it might be helpful for future debug. thx in advance.

AlbinSou commented 5 months ago

Hey, let's try to find out what happens. Can you check that gpu access with pytorch works ? By going into python, import torch and print(torch.cuda.is_available)

i have gpu acess. i manage to find the problem. it seems that in some case (e.g. using CUDA_VISIBLE_DEVICES=X), it would be frozen and cannot start training. if add a line of code, training will not be frozen, but can not use GPU. strategy_dict['device'] = 'cpu' # add cl_strategy = globals()[strategy](**strategy_dict, plugins=plugins) Since avalanche is like pytorch-lightning using so many loops, it's hard to debug. Any advice? thank you for kind reply!

Yes, not using GPU is not an acceptable solution. I hope that solving the CUDA compatibility issue did the job. As for debugging, it really depends. I sometimes create a "debug" plugin that calls PDB in one of the callback that way I can inspect the state of the strategy at a given callback, without having to call pdb directly in the source code of avalanche. However, it's true that it makes it a bit more complicated than usual. And for the kind of issue that you just had (stuck in somewhere), it does not really help, or you have to find function by function where the program got stuck.

my problem is solved by reinstall the avalanche env using the commands listed above. if that's ok for you, you can close this issue.

by the way, would you mind sharing your "debug plugin", i think it might be helpful for future debug. thx in advance.

Ok, glad you solved it. As for the debug plugin, you can make it as easy as a plugin that calls "import pdb;pdb.set_trace()" inside the callback that you want to debug, you can also call this line under some condition that you want the debugger to be called on.

AlbinSou / ocl_survey

got stuck in somethere when try start a experiment. #10