Closed yusufani closed 3 years ago
1) This might be fastai/pytorch with Windows related issue and I would highly recommend you to ask about this in fastai or in PyTorch. For example, were you previously able to run distributed data training on your environment, e.g. distributed training? Try to see If your environment or setup indeed works when you try to run a simple example such as this. If you don't face any issues training this script then we can look closer to CLIP script.
2) It looks like the script is called in distributed mode so I assume you have multiple GPUs in your training environment? Otherwise, if you are not using multiple GPUs you can't use zero optimizer, because zero optimizer users multiple processes/gpus to shard the optimizer states.
In terms of downloading pytorch from the source, this library and fastai doesn't require installing pytorch from the source. Notebook examples work on regular colab too, after simply doing pip install -U fastai
and pip install self-supervised
.
After some googling that error, I found this solution. So, I added the following code After the code on line 166 in zero_optimizer .py file.
The code I added : dist.init_process_group(backend="mpi", group_name="main")
You don't need to manually initialize process group python -m fastai.launch
already does that for you. You can check here.
Can you share example Colab notebook for CLIP? Absolutely.
@yusufani Here you go, I created a notebook which demonstrates how to train CLIP model with COCO captions dataset as an example on a single GPU. You should be able to open any github notebook link with Colab.
Thank you for your quick feedback. I have been getting errors for trying to train with a Single GPU. Colab Notebook also works.
Then here is our answer to the problem, zero optimizer doesn't work on single GPU try another opt_func
if you would like to use the script!
e.g.
... --arch vitb32 --size 224 --bs 360 --epochs 24 --lr 1e-4 --use_grad_check True --grad_check_nchunks 2--opt ranger
I have RTX2070 as a GPU in my local but I got some allocation problems, I decided to try Colab, but I get the following error :
RuntimeError: expected scalar type Half but found Float
You have stated this problem in the Training Clip file, and I was able to solve this problem with that solution. But I'm getting the same error in Colab although I edit the checkpoint file.
I don't work with Colab much. Try restarting the kernel after editing the file. I am not sure but maybe the file changes are not imported in the current session. You can try viewing the definitions to see if changes are there. This would be a Colab related issue.
Hi,
I have been trying to train a CLIP model from scratch. After edit data loader functions in this code, When I want to start training with the following code, I got the following error.
Running parameters:
python -m fastai.launch "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py" --arch vitb32 --size 224 --bs 360 --epochs 24 --lr 1e-4 --use_grad_check True --grad_check_nchunks 2
Error :
After some googling that error, I found this solution. So, I added the following code After the code on line 166 in zero_optimizer .py file.
The code I added :
dist.init_process_group(backend="mpi", group_name="main")
It was like that after I add :
After applying that solution, I encountered another error and I understand that I must install PyTorch from the source.
Error:
I tried many times to install PyTorch from source on my windows, but I couldn't manage it yet. Also, I have tried the same steps on Google Colab too. But not worked.
Is there any way to train CLIP with normal PyTorch or am I missing something?
Can you share example Colab notebook for CLIP?