Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search
https://arxiv.org/abs/2202.08904
MIT License
854 stars 54 forks source link

Need help fine-tuning sgpt model #18

Closed faicalbounedjar closed 8 months ago

faicalbounedjar commented 1 year ago

Hello,

I am interested in using your pre-trained sgpt model for my project, but I am a bit lost when it comes to fine-tuning it for my specific use case(let's say i want to use dbpeida). I was wondering if someone could provide me with a guide or some resources to help me get started.

I would really appreciate any help you can give me. Thank you in advance!

Muennighoff commented 1 year ago

Hey! I'd recommend you to try the pre-trained models first: https://github.com/Muennighoff/sgpt#use-sgpt-with-huggingface

If you would like to fine-tune yourself, you can follow the guidelines here: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco

Note that only fine-tuning on NLI or MSMARCO is implemented, so if you want to fine-tune on something else, you will need to add that dataset (should not be very difficult). If you don't have negatives for your dataset though it may be difficult to get better performance than the NLI / MSMARCO fine-tuned models.

faicalbounedjar commented 1 year ago

hi again , thank you i just have few more questions to clarify my vision & path 1- how to evaluate the sgpt model on a list of docs and queries (let say i want to evaluate the model on dbpedia or a custom dataset ) 2- when i have a big list of documents how to tokenize them (i am struggling with this problem because in case i want to fine tune the model i have to get the emmbeddings and it's crashing a lot )

Muennighoff commented 1 year ago

1) For evaluating on BEIR datasets (includes dpbedia), you can use the files in biencoder/beir folder. Specifically, this simple script may be useful to you: https://github.com/Muennighoff/sgpt/tree/main/biencoder/beir#quick-benchmarking

2) The code tokenizes them on the fly using the huggingface tokenize - if that's too slow for you can look into openai's tiktoken, but I'm not sure if its compatible with the huggingface models

faicalbounedjar commented 1 year ago

okey thank you for your guide , for the fine tuning i tried following your instruction in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco but i ended up with a lot of error messages when i run the training command (accelerate launch ...) (i already installed the requirements and i tried the initial code i didnt change anything for now )

Muennighoff commented 1 year ago

okey thank you for your guide , for the fine tuning i tried following your instruction in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco but i ended up with a lot of error messages when i run the training command (accelerate launch ...) (i already installed the requirements and i tried the initial code i didnt change anything for now )

Hm can you share the errors you get? You need to make sure to setup accelerate correctly with your machine specs via accelerate config

faicalbounedjar commented 1 year ago

the error :

/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
 E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
 I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
 E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
  File "examples/training/nli/training_nli_v2.py", line 15, in <module>
    from sentence_transformers import models, losses, datasets
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/__init__.py", line 3, in <module>
    from .datasets import SentencesDataset, ParallelSentencesDataset
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/datasets/__init__.py", line 3, in <module>
    from .ParallelSentencesDataset import ParallelSentencesDataset
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/datasets/ParallelSentencesDataset.py", line 4, in <module>
    from .. import SentenceTransformer
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/SentenceTransformer.py", line 28, in <module>
    from .evaluation import SentenceEvaluator
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/evaluation/__init__.py", line 5, in <module>
    from .InformationRetrievalEvaluator import InformationRetrievalEvaluator
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/evaluation/InformationRetrievalEvaluator.py", line 6, in <module>
    from ..util import cos_sim, dot_score
  File "/home/sgpt/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/util.py", line 459, in <module>
    from huggingface_hub.snapshot_download import REPO_ID_SEPARATOR
ModuleNotFoundError: No module named 'huggingface_hub.snapshot_download'
Traceback (most recent call last):
  File "/home/faical/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1209, in launch_command
    simple_launcher(args)
  File "/home/faical/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 591, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'examples/training/nli/training_nli_v2.py', '--model_name', 'EleutherAI/gpt-neo-125M', '--pooling', 'mean']' returned non-zero exit status 1.

when i executed the accelerate config i followed the steps you mentionned in the https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco and left the yes/NO empty . and how to know exactly my machine specs ?

Muennighoff commented 1 year ago

Try pip install --upgrade huggingface-hub==0.10.1 (https://github.com/UKPLab/sentence-transformers/issues/1762)

like how many gpus you have etc; accelerate asks you for it

faicalbounedjar commented 1 year ago

what if i used the training_nli.py it should work ! if no is there a way to fine tune it without the accelerate (because i am having the same probleme as yesterday i tried my specs and still alot of errors)

faicalbounedjar commented 1 year ago

and can i run the fine tuning in a colab/Kaggle notebook ? for better pref