BatsResearch / csp

Learning to compose soft prompts for compositional zero-shot learning.
BSD 3-Clause "New" or "Revised" License
83 stars 6 forks source link

Clip giving error and also cuda out of memory #18

Closed ans92 closed 8 months ago

ans92 commented 8 months ago

Hi, Thank you for great work and also sharing code with us. I am trying to run the code but I am getting the error. I am getting the following output when running the train.py file:

/home/ans/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
training details
Namespace(experiment_name='csp', dataset='ut-zappos', lr=5e-05, weight_decay=1e-05, clip_model='ViT-L/14', epochs=20, train_batch_size=64, eval_batch_size=1024, evaluate_only=False, context_length=8, attr_dropout=0.3, save_path='/home/ans/CZSL/csp-model-saved/mit-states/sample_model', save_every_n=1, save_model=False, seed=0, gradient_accumulation_steps=2)
####
/home/ans/DATA_ROOT/ut-zap50k/
# train pairs: 83 | # val pairs: 30 | # test pairs: 36
# train images: 22998 | # val images: 3214 | # test images: 2914
model dtype torch.float16
soft embedding dtype torch.float32
epoch   1:   0%|                                                                                                    | 0/360 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ans/CZSL/csp-main/train.py", line 218, in <module>
    model, optimizer = train_model(
  File "/home/ans/CZSL/csp-main/train.py", line 64, in train_model
    batch_feat = model.encode_image(batch_img)
  File "/home/ans/CZSL/csp-main/clip_modules/interface.py", line 65, in encode_image
    return self.clip_model.encode_image(imgs)
  File "/home/ans/.local/lib/python3.10/site-packages/clip/model.py", line 342, in encode_image
    return self.visual(image.type(self.dtype))
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/clip/model.py", line 233, in forward
    x = self.transformer(x)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/clip/model.py", line 204, in forward
    return self.resblocks(x)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/clip/model.py", line 192, in forward
    x = x + self.mlp(self.ln_2(x))
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/ans/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ans/.local/lib/python3.10/site-packages/clip/model.py", line 169, in forward
    return x * torch.sigmoid(1.702 * x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB (GPU 0; 11.77 GiB total capacity; 10.25 GiB already allocated; 64.06 MiB free; 10.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
epoch   1:   0%|                                                                                                    | 0/360 [00:00<?, ?it/s]

The first line of error is due to importing clip so I think clip is not importing properly. Secondly I am getting cuda out of memory on smaller ut-zappos dataset. I am using RTX 3080 Ti (12GB) and it is close to your used gpu so I think it should not be out of memory. Can you please help me in this regard.

Please note that I have used requirements.txt file to install clip. I do not have conda and also I can not install it as I do not have admin access.

ans92 commented 8 months ago

There was an issue in clip. My torch and torchvision are not compatible with each other. You can take help from this page: https://pypi.org/project/torchvision/

I installed the torch and torchvision through this link:

pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117

And it resolved the issue of clip. But I am still getting the issue of cuda out of memory. For this I reduced the batch size to 4.