KeremTurgutlu / self_supervised

Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks.
Apache License 2.0
318 stars 33 forks source link

Is there any way to train a CLIP model without install pytorch from source ? #42

Closed yusufani closed 3 years ago

yusufani commented 3 years ago

Hi,

I have been trying to train a CLIP model from scratch. After edit data loader functions in this code, When I want to start training with the following code, I got the following error.

Running parameters: python -m fastai.launch "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py" --arch vitb32 --size 224 --bs 360 --epochs 24 --lr 1e-4 --use_grad_check True --grad_check_nchunks 2

Error :

Dataframe is read
1533 10000
Distributed training mode
vitb32 True <class 'bool'> 2
Traceback (most recent call last):
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 126, in <module>
    def main(
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastcore\script.py", line 110, in call_parse
    return _f()
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastcore\script.py", line 105, in _f
    tfunc(**merge(args, args_from_prog(func, xtra)))
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 212, in main
    learner.fit_flat_cos(epochs, lr, pct_start=0.25)
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastai\callback\schedule.py", line 131, in fit_flat_cos
    if self.opt is None: self.create_opt()
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastai\learner.py", line 149, in create_opt
    self.opt = self.opt_func(self.splitter(self.model), lr=self.lr)
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 167, in zero
    return OptimWrapper(ZeroRedundancyOptimizer(params, optimizer_class=torch.optim.Adam, lr=lr))
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\zero_optimizer.py", line 173, in __init__
    self.world_size = dist.get_world_size(self.group)
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 638, in get_world_size
    return _get_group_size(group)
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized

After some googling that error, I found this solution. So, I added the following code After the code on line 166 in zero_optimizer .py file.

The code I added : dist.init_process_group(backend="mpi", group_name="main")

It was like that after I add :

        ...
        self._all_params = params
        self._reference_is_trainable_mask = list(map(_is_trainable, self._all_params))
        #print(torch.distributed.is_available())
        #print(torch.distributed.get_backend(group=None))
        dist.init_process_group(backend="mpi", group_name="main")

        # Build the wrapped optimizer, responsible for a shard of the params
        self.group = group if group is not None else dist.group.WORLD
        ...

After applying that solution, I encountered another error and I understand that I must install PyTorch from the source.

Error:

Dataframe is read
1533 10000
Distributed training mode
vitb32 True <class 'bool'> 2
Traceback (most recent call last):
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 126, in <module>
    def main(
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastcore\script.py", line 110, in call_parse
    return _f()
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastcore\script.py", line 105, in _f
    tfunc(**merge(args, args_from_prog(func, xtra)))
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 212, in main
    learner.fit_flat_cos(epochs, lr, pct_start=0.25)
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastai\callback\schedule.py", line 131, in fit_flat_cos
    if self.opt is None: self.create_opt()
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\fastai\learner.py", line 149, in create_opt
    self.opt = self.opt_func(self.splitter(self.model), lr=self.lr)
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\training_clip.py", line 167, in zero
    return OptimWrapper(ZeroRedundancyOptimizer(params, optimizer_class=torch.optim.Adam, lr=lr))
  File "D:\Kariyer\Projects\YTU\YTU_Multi_Modal_Contrastive_Learning\Multi_Modal_Contrastive_Learning\Kerem_Turgutlu\examples\zero_optimizer.py", line 169, in __init__
    dist.init_process_group(backend="mpi", group_name="main")
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 422, in init_process_group
    _default_pg = _new_process_group_helper(
  File "C:\Users\Yusuf\anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 495, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.

I tried many times to install PyTorch from source on my windows, but I couldn't manage it yet. Also, I have tried the same steps on Google Colab too. But not worked.

Is there any way to train CLIP with normal PyTorch or am I missing something?

Can you share example Colab notebook for CLIP?

KeremTurgutlu commented 3 years ago

1) This might be fastai/pytorch with Windows related issue and I would highly recommend you to ask about this in fastai or in PyTorch. For example, were you previously able to run distributed data training on your environment, e.g. distributed training? Try to see If your environment or setup indeed works when you try to run a simple example such as this. If you don't face any issues training this script then we can look closer to CLIP script.

2) It looks like the script is called in distributed mode so I assume you have multiple GPUs in your training environment? Otherwise, if you are not using multiple GPUs you can't use zero optimizer, because zero optimizer users multiple processes/gpus to shard the optimizer states.

In terms of downloading pytorch from the source, this library and fastai doesn't require installing pytorch from the source. Notebook examples work on regular colab too, after simply doing pip install -U fastai and pip install self-supervised.

After some googling that error, I found this solution. So, I added the following code After the code on line 166 in zero_optimizer .py file.

The code I added : dist.init_process_group(backend="mpi", group_name="main")

You don't need to manually initialize process group python -m fastai.launch already does that for you. You can check here.

Can you share example Colab notebook for CLIP? Absolutely.

KeremTurgutlu commented 3 years ago

@yusufani Here you go, I created a notebook which demonstrates how to train CLIP model with COCO captions dataset as an example on a single GPU. You should be able to open any github notebook link with Colab.

yusufani commented 3 years ago

Thank you for your quick feedback. I have been getting errors for trying to train with a Single GPU. Colab Notebook also works.

KeremTurgutlu commented 3 years ago

Then here is our answer to the problem, zero optimizer doesn't work on single GPU try another opt_func if you would like to use the script!

e.g. ... --arch vitb32 --size 224 --bs 360 --epochs 24 --lr 1e-4 --use_grad_check True --grad_check_nchunks 2--opt ranger

yusufani commented 3 years ago

I have RTX2070 as a GPU in my local but I got some allocation problems, I decided to try Colab, but I get the following error :

RuntimeError: expected scalar type Half but found Float

You have stated this problem in the Training Clip file, and I was able to solve this problem with that solution. But I'm getting the same error in Colab although I edit the checkpoint file.

KeremTurgutlu commented 3 years ago

I don't work with Colab much. Try restarting the kernel after editing the file. I am not sure but maybe the file changes are not imported in the current session. You can try viewing the definitions to see if changes are there. This would be a Colab related issue.

yusufani commented 3 years ago

Instead of changing the rows 62.and 94 , completely copying that file has solved the problem. I don't know why: D