Open shaileshj2803 opened 2 years ago
Sure, the two files this repository uses for finetuning are:
Symmetric Search on NLI: https://github.com/Muennighoff/sgpt/blob/main/biencoder/nli_msmarco/sentence-transformers/examples/training/nli/training_nli_v2.py
Asymmetric Search on MSMARCO: https://github.com/Muennighoff/sgpt/blob/main/biencoder/nli_msmarco/sentence-transformers/examples/training/ms_marco/train_bi-encoder_mnrl.py
I would copy one of them and replace the dataset loaded in the file with your custom domain dataset.
If your dataset is very big, I'd recommend fine-tuning a pre-trained GPT model like in the code. If it's very small, I would recommend using one of the trained SGPT models and fine-tuning it further.
If the dataset is very big, which pre-trained GPT should we use ? and once we fine-tune it we'll have to create the SGPT from the finetuned GPT right ?
1) The larger the better. The largest one used in the codebase is https://huggingface.co/EleutherAI/gpt-j-6B. It will have 5.8B parameters after fine-tuning. 2) For fine-tuning, we just remove the language modelling head, add position weighted mean pooling & optionally use BitFit. The fine-tuned model can then directly be used to produce embeddings for your use case.
Thank you for your fast reply, does it cause a problem if the dataset only has positive examples and doesnt have negative examples ?
Thank you for your fast reply, does it cause a problem if the dataset only has positive examples and doesnt have negative examples ?
Sorry for the late reply. Yes, it is expected to decrease performance without negative examples. How much will depend on your data, but you can try running the NLI scripts without negative & with negative to get a feeling for how much it would be worse.
Hi I read your paper that is cool, am trying to do this on my own dataset and my dataset is huge. Can you please tell me the exact ways to train from the scratch to achieve SGPT- both symmetric and asymmetric in both the encoder. But cross encoder would be our interest.
I Have one doubt are you using bert to produce cross and BI encoder embedding. In my understanding you are using BERT as initial pipeline before fetching it to GPT to produce the cosine similarity and log probabilities please help.
Can you please share how i can finetune for my custom domain datasets?
可以加下你的联系方式吗?有问题请教
Can you please share how i can finetune for my custom domain datasets?