How to fine tune on my datasets

Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

https://arxiv.org/abs/2202.08904

MIT License

851 stars 52 forks source link

How to fine tune on my datasets #2

Open shaileshj2803 opened 2 years ago

shaileshj2803 commented 2 years ago

Can you please share how i can finetune for my custom domain datasets?

Muennighoff commented 2 years ago

Sure, the two files this repository uses for finetuning are:

Symmetric Search on NLI: https://github.com/Muennighoff/sgpt/blob/main/biencoder/nli_msmarco/sentence-transformers/examples/training/nli/training_nli_v2.py

Asymmetric Search on MSMARCO: https://github.com/Muennighoff/sgpt/blob/main/biencoder/nli_msmarco/sentence-transformers/examples/training/ms_marco/train_bi-encoder_mnrl.py

I would copy one of them and replace the dataset loaded in the file with your custom domain dataset.

If your dataset is very big, I'd recommend fine-tuning a pre-trained GPT model like in the code. If it's very small, I would recommend using one of the trained SGPT models and fine-tuning it further.

marouaghaouat commented 2 years ago

If the dataset is very big, which pre-trained GPT should we use ? and once we fine-tune it we'll have to create the SGPT from the finetuned GPT right ?

Muennighoff commented 2 years ago

1) The larger the better. The largest one used in the codebase is https://huggingface.co/EleutherAI/gpt-j-6B. It will have 5.8B parameters after fine-tuning. 2) For fine-tuning, we just remove the language modelling head, add position weighted mean pooling & optionally use BitFit. The fine-tuned model can then directly be used to produce embeddings for your use case.

marouaghaouat commented 2 years ago

Thank you for your fast reply, does it cause a problem if the dataset only has positive examples and doesnt have negative examples ?

Muennighoff commented 2 years ago

Thank you for your fast reply, does it cause a problem if the dataset only has positive examples and doesnt have negative examples ?

Sorry for the late reply. Yes, it is expected to decrease performance without negative examples. How much will depend on your data, but you can try running the NLI scripts without negative & with negative to get a feeling for how much it would be worse.

rajarajanvakil commented 2 years ago

Hi I read your paper that is cool, am trying to do this on my own dataset and my dataset is huge. Can you please tell me the exact ways to train from the scratch to achieve SGPT- both symmetric and asymmetric in both the encoder. But cross encoder would be our interest.

rajarajanvakil commented 2 years ago

I Have one doubt are you using bert to produce cross and BI encoder embedding. In my understanding you are using BERT as initial pipeline before fetching it to GPT to produce the cosine similarity and log probabilities please help.

asenasen123 commented 1 year ago

Can you please share how i can finetune for my custom domain datasets?

可以加下你的联系方式吗？有问题请教