Could I fine tune this model for Chinese datasets?

Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

https://arxiv.org/abs/2202.08904

MIT License

823 stars 51 forks source link

Could I fine tune this model for Chinese datasets? #41

Open asenasen123 opened 10 months ago

asenasen123 commented 10 months ago

Could you please tell me how i can fine tune for my custom Chinese datasets?

Muennighoff commented 10 months ago

Sure if you want to finetune you can follow some of what is outlined in this issue: https://github.com/Muennighoff/sgpt/issues/2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

asenasen123 commented 10 months ago

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

Do many spgt models on Huggingface support Chinese？

asenasen123 commented 10 months ago

If I want to fine-tune the sgpt model, do I just change the dataset?

Muennighoff commented 10 months ago

I think only the bloom ones perform well for Chinese. Yes you can just change the dataset.

asenasen123 commented 10 months ago

I think only the bloom ones perform well for Chinese. Yes you can just change the dataset.

Which Chinese dataset should I evaluate the fine-tuned model on?

Muennighoff commented 10 months ago

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see https://github.com/embeddings-benchmark/mteb/pull/134

asenasen123 commented 10 months ago

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard

Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

Muennighoff commented 10 months ago

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

asenasen123 commented 10 months ago

I would evaluate on the Chinese datasets in MTEB. If you train a Retrieval model, you can try the Chinese Retrieval datasets from C-MTEB: https://huggingface.co/spaces/mteb/leaderboard Also see embeddings-benchmark/mteb#134

Are evaluation indicators also Pearson and Spearman?

For retrieval datasets its nDCG@10 ; But don't worry about the evaluation - if you use MTEB it takes care of automatically calculating the scores etc.

Thank you very much!

wilfoderek commented 7 months ago

Sure if you want to finetune you can follow some of what is outlined in this issue: #2

For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

Muennighoff commented 7 months ago

Sure if you want to finetune you can follow some of what is outlined in this issue: #2 For asymmetric search (e.g. retrieval), you can also try https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco which has seen lots of Chinese during pretraining & might be good enough

what about spanish fine tune?

Sure you can do that too. https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco has also seen a lot of Spanish so it may work well for you.