SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard
https://arxiv.org/abs/2309.12871
MIT License
397 stars 30 forks source link

Code embeddings #84

Open rragundez opened 1 week ago

rragundez commented 1 week ago

Is there any information if this is also recommended for extracting embeddings from code snippets? In particular Javascipt and Solidity?

SeanLee97 commented 1 week ago

Hi @rragundez , Maybe you can have a try to WhereIsAI/UAE-Code-Large-V1. It was trained using the github-issue-similarity dataset, which contains some javascript code.

angle = AnglE.from_pretrained('WhereIsAI/UAE-Code-Large-V1').cuda()

angle.encode("YOUR CODE")
rragundez commented 1 week ago

Let me try it and I'll comment back here the results

rragundez commented 3 days ago

It did work but results over solidity code is not very good. thanks.

I am going to try with LLM trained on SOlidity code, but it has GGUF files, how would I use those in this library? for example:

https://huggingface.co/mradermacher/Solidity-Llama3-8b-GGUF

SeanLee97 commented 3 days ago

maybe you can use its base model: https://huggingface.co/andrijdavid/Solidity-Llama3-8b

rragundez commented 3 days ago

Would this work out of the box just putting the model name as the argument?

SeanLee97 commented 3 days ago

Yes. For LLM inference, you can check it document: https://angle.readthedocs.io/en/latest/notes/quickstart.html#infer-llm-based-models

Since this model hasn't been trained on sentence embedding learning, it is recommended to use some prompts to improve performance. You can specify a prompt with angle.encode(..., prompt="Here is a prompt: {text}.").

SeanLee97 commented 3 days ago

Yes. For LLM inference, you can check it document: https://angle.readthedocs.io/en/latest/notes/quickstart.html#infer-llm-based-models

Since this model hasn't been trained on sentence embedding learning, it is recommended to use some prompts to improve performance. You can specify a prompt with angle.encode(..., prompt="Here is a prompt: {text}.").

there is no need to specify a pretrained_lora_path, just directly specify the model_name_or_path to andrijdavid/Solidity-Llama3-8b