Extend embedding module with Llama3 model

mehmetcanay commented 2 months ago

Now it can be used from HugginFace with the model ID "meta-llama/Meta-Llama-3-8B-Instruct".

tiadams commented 3 weeks ago

Now it can be used from HugginFace with the model ID "meta-llama/Meta-Llama-3-8B-Instruct".

We should test wether this performs better than MPNet (and also the time wise trade-off) and replace the MPNet default if it produces better results

mehmetcanay commented 2 weeks ago

Access to meta-llama/Meta-Llama-3-8B-Instruct model is restricted. Also, it is no longer labeled as "sentence similarity" but "text generation". McGill has an alternative model McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised, but we need to write a new adapter as this model does not utilize SentenceTransformer module.

tiadams commented 2 weeks ago

We need in general a well performing model that can map from other languages (e.g. german) to the english terminologies we are using.

Doesn't need to be llama specifically but we need some sort of alternative to MPNet since it can only handle english to english.

mehmetcanay commented 2 weeks ago

I am looking into it. For now, we can test FremyCompany/BioLORD-2023-M as it can handle semantic similarity in a multi-lingual context. This also does not require a new adapter.

mehmetcanay commented 2 weeks ago

FremyCompany/BioLORD-2023-M could not handle our CDM for an unknown reason. For other models I have found that work with our adapter I re-ran the workflow I had for translated BIOFIND dictionary (145 variables). I put the old models we have tested for comparison. Here are the results:

	Enligh to German	German to English	Average
text-embedding-3-large model	0.46	0.53	0.50
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	0.19	0.28	0.24
sentence-transformers/distiluse-base-multilingual-cased-v1	0.23	0.23	0.23
sentence-transformers/distiluse-base-multilingual-cased-v2	0.15	0.25	0.20
sentence-transformers/all-mpnet-base-v2	0.12	0.19	0.16
FremyCompany/BioLORD-2023	0.14	0.13	0.14

Likewise, I tested the new models with our harmonization workflow:

	Average
text-embedding-3-large model	0.77
FremyCompany/BioLORD-2023	0.76
sentence-transformers/all-mpnet-base-v2	0.73
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	0.72
sentence-transformers/distiluse-base-multilingual-cased-v1	0.61
sentence-transformers/distiluse-base-multilingual-cased-v2	0.60

tiadams commented 2 weeks ago

FremyCompany/BioLORD-2023-M could not handle our CDM for an unknown reason. For other models I have found that work with our adapter I re-ran the workflow I had for translated BIOFIND dictionary (145 variables). I put the old models we have tested for comparison. Here are the results: Enligh to German German to English Average text-embedding-3-large model 0.46 0.53 0.50 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 0.19 0.28 0.24 sentence-transformers/distiluse-base-multilingual-cased-v1 0.23 0.23 0.23 sentence-transformers/distiluse-base-multilingual-cased-v2 0.15 0.25 0.20 sentence-transformers/all-mpnet-base-v2 0.12 0.19 0.16 FremyCompany/BioLORD-2023 0.14 0.13 0.14

Likewise, I tested the new models with our harmonization workflow: Average text-embedding-3-large model 0.77 FremyCompany/BioLORD-2023 0.76 sentence-transformers/all-mpnet-base-v2 0.73 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 0.72 sentence-transformers/distiluse-base-multilingual-cased-v1 0.61 sentence-transformers/distiluse-base-multilingual-cased-v2 0.60

Thanks for the research, great work. I think based in this we should probably switch to the text-embedding-3-large model since it appears to perform best in both metrics

tiadams commented 2 weeks ago

Just saw this is openAI based, probably best trade off then is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, what do you think?

SCAI-BIO / datastew

Extend embedding module with Llama3 model #3