Open QuangTQV opened 5 months ago
Can anyone help me, thanks?
Hi, please try specify --dtype fp32
in the training script.
Hi, please try specify
--dtype fp32
in the training script.
After finetuning, I tested several cases. The positive samples scored around 0.9, while the negative samples scored around 0.84. I feel this is not acceptable, how can I widen this gap? I fine-tuned for the retrieval task
Hi, this is the direct result of contrastive learning. It only guarantees the positives have higher scores than negatives, while not assuring that their gaps are large enough. You can try using some margin based loss to emphasize greater gaps between positives and negatives. However, the model may be harder to train given losses other than contrastive learning.
Here is the Google Colab link I used for fine-tuning : https://colab.research.google.com/drive/1kiALBR1UarPobiftZmiHfwFyk7hTCDnV?usp=sharing
When I fine-tune the LLM-embed for tool retrieval using the command on Google Colab: An error occurred:
04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss with AVX2 support. 04/30/2024 23:52:47 - INFO - faiss.loader - Could not load library with AVX2 support due to: ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'") 04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss. 04/30/2024 23:52:47 - INFO - faiss.loader - Successfully loaded faiss. 2024-04-30 23:52:47.990022: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-30 23:52:47.990076: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-30 23:52:47.991470: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-30 23:52:49.207020: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 04/30/2024 23:52:49 - INFO - src.retrieval.modeling_dense - Loading tokenizer and model from BAAI/bge-base-en... max_steps is given, it will override any value given in num_train_epochs 0% 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 157, in
main()
File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 150, in main
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2249, in _inner_training_loop
_grad_norm = self.accelerator.clip_gradnorm(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2157, in clip_gradnorm
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2107, in unscalegradients
self.scaler.unscale(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/gradscaler.py", line 336, in unscale
optimizer_state["found_inf_per_device"] = self._unscalegrads(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 258, in _unscalegrads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
0% 0/2000 [00:01<?, ?it/s]
[2024-04-30 23:53:01,805] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5985) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py FAILED
Failures: