FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7k stars 511 forks source link

ValueError: Attempting to unscale FP16 gradients. #745

Open QuangTQV opened 5 months ago

QuangTQV commented 5 months ago

Here is the Google Colab link I used for fine-tuning : https://colab.research.google.com/drive/1kiALBR1UarPobiftZmiHfwFyk7hTCDnV?usp=sharing

When I fine-tune the LLM-embed for tool retrieval using the command on Google Colab: image An error occurred:

04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss with AVX2 support. 04/30/2024 23:52:47 - INFO - faiss.loader - Could not load library with AVX2 support due to: ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'") 04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss. 04/30/2024 23:52:47 - INFO - faiss.loader - Successfully loaded faiss. 2024-04-30 23:52:47.990022: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-30 23:52:47.990076: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-30 23:52:47.991470: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-30 23:52:49.207020: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 04/30/2024 23:52:49 - INFO - src.retrieval.modeling_dense - Loading tokenizer and model from BAAI/bge-base-en... max_steps is given, it will override any value given in num_train_epochs 0% 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 157, in main() File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 150, in main trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2249, in _inner_training_loop _grad_norm = self.accelerator.clip_gradnorm( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2157, in clip_gradnorm self.unscale_gradients() File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2107, in unscalegradients self.scaler.unscale(opt) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/gradscaler.py", line 336, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 258, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. 0% 0/2000 [00:01<?, ?it/s] [2024-04-30 23:53:01,805] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5985) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-30_23:53:01 host : f8adfa8a5d97 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5985) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
QuangTQV commented 5 months ago

Can anyone help me, thanks?

namespace-Pt commented 5 months ago

Hi, please try specify --dtype fp32 in the training script.

QuangTQV commented 5 months ago

Hi, please try specify --dtype fp32 in the training script.

After finetuning, I tested several cases. The positive samples scored around 0.9, while the negative samples scored around 0.84. I feel this is not acceptable, how can I widen this gap? I fine-tuned for the retrieval task

namespace-Pt commented 5 months ago

Hi, this is the direct result of contrastive learning. It only guarantees the positives have higher scores than negatives, while not assuring that their gaps are large enough. You can try using some margin based loss to emphasize greater gaps between positives and negatives. However, the model may be harder to train given losses other than contrastive learning.