Closed faicalbounedjar closed 8 months ago
Hey! I'd recommend you to try the pre-trained models first: https://github.com/Muennighoff/sgpt#use-sgpt-with-huggingface
If you would like to fine-tune yourself, you can follow the guidelines here: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco
Note that only fine-tuning on NLI or MSMARCO is implemented, so if you want to fine-tune on something else, you will need to add that dataset (should not be very difficult). If you don't have negatives for your dataset though it may be difficult to get better performance than the NLI / MSMARCO fine-tuned models.
hi again , thank you i just have few more questions to clarify my vision & path 1- how to evaluate the sgpt model on a list of docs and queries (let say i want to evaluate the model on dbpedia or a custom dataset ) 2- when i have a big list of documents how to tokenize them (i am struggling with this problem because in case i want to fine tune the model i have to get the emmbeddings and it's crashing a lot )
1) For evaluating on BEIR datasets (includes dpbedia), you can use the files in biencoder/beir
folder. Specifically, this simple script may be useful to you: https://github.com/Muennighoff/sgpt/tree/main/biencoder/beir#quick-benchmarking
2) The code tokenizes them on the fly using the huggingface tokenize - if that's too slow for you can look into openai's tiktoken, but I'm not sure if its compatible with the huggingface models
okey thank you for your guide , for the fine tuning i tried following your instruction in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco but i ended up with a lot of error messages when i run the training command (accelerate launch ...) (i already installed the requirements and i tried the initial code i didnt change anything for now )
okey thank you for your guide , for the fine tuning i tried following your instruction in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco but i ended up with a lot of error messages when i run the training command (accelerate launch ...) (i already installed the requirements and i tried the initial code i didnt change anything for now )
Hm can you share the errors you get?
You need to make sure to setup accelerate correctly with your machine specs via accelerate config
the error :
/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
File "examples/training/nli/training_nli_v2.py", line 15, in <module>
from sentence_transformers import models, losses, datasets
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/__init__.py", line 3, in <module>
from .datasets import SentencesDataset, ParallelSentencesDataset
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/datasets/__init__.py", line 3, in <module>
from .ParallelSentencesDataset import ParallelSentencesDataset
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/datasets/ParallelSentencesDataset.py", line 4, in <module>
from .. import SentenceTransformer
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/SentenceTransformer.py", line 28, in <module>
from .evaluation import SentenceEvaluator
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/evaluation/__init__.py", line 5, in <module>
from .InformationRetrievalEvaluator import InformationRetrievalEvaluator
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/evaluation/InformationRetrievalEvaluator.py", line 6, in <module>
from ..util import cos_sim, dot_score
File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/util.py", line 459, in <module>
from huggingface_hub.snapshot_download import REPO_ID_SEPARATOR
ModuleNotFoundError: No module named 'huggingface_hub.snapshot_download'
Traceback (most recent call last):
File "/home/faical/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1209, in launch_command
simple_launcher(args)
File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 591, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'examples/training/nli/training_nli_v2.py', '--model_name', 'EleutherAI/gpt-neo-125M', '--pooling', 'mean']' returned non-zero exit status 1.
when i executed the accelerate config
i followed the steps you mentionned in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco and left the yes/NO empty .
and how to know exactly my machine specs ?
Try pip install --upgrade huggingface-hub==0.10.1
(https://github.com/UKPLab/sentence-transformers/issues/1762)
like how many gpus you have etc; accelerate asks you for it
what if i used the training_nli.py
it should work ! if no is there a way to fine tune it without the accelerate (because i am having the same probleme as yesterday i tried my specs and still alot of errors)
and can i run the fine tuning in a colab/Kaggle notebook ? for better pref
Hello,
I am interested in using your pre-trained sgpt model for my project, but I am a bit lost when it comes to fine-tuning it for my specific use case(let's say i want to use dbpeida). I was wondering if someone could provide me with a guide or some resources to help me get started.
I would really appreciate any help you can give me. Thank you in advance!