beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
Apache License 2.0
576 stars 27 forks source link

PCA issue #52

Closed mikelee-dev closed 3 weeks ago

mikelee-dev commented 1 month ago

does anyone else get a similar issue during training?


  File "./train_long_clip.py", line 425, in <module>
    raise e
  File "./train_long_clip.py", line 395, in <module>
    loss_long, loss_short, _, __, ___, ____ = model(images, long_text_inputs, short_text_inputs)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Long_CLIP/model/model_longclip.py", line 484, in forward
    image_features_short = self.PCA(image_features_long, 32)
  File "/Long_CLIP/model/model_longclip.py", line 403, in PCA
    U, S, Vt = torch.linalg.svd(X_centered, full_matrices=False)
torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 29).```
mikelee-dev commented 1 month ago

It appears that, this is triggered by self.encode_image(image), which is sometimes resulting in a tensor of nans, although image does not contain nans

beichenzbc commented 1 month ago

That's strange, i didn't encounter that problem. Perhaps you may check whether you download ShareGPT4V completely and rerun the training code. If it still occurs, you may add a try... except.... to avoid that since it rarely happens

MitsuiChen14 commented 1 day ago

Hello, I apologize for the interruption. While fine-tuning my dataset, I encountered the error torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 63). After some debugging, I found that this issue occurred after using the AdamW optimizer, resulting in the image tensor becoming NaN. However, switching to the SGD optimizer resolved the issue. I am not sure what caused this problem. If you have any insights, I would greatly appreciate your help!