Closed ioggstream closed 2 months ago
Thank you for your interest, @ioggstream,
The tool is part of a larger project I’m currently working on. At this stage, I’m just embedding the primary labels. Initially, I assumed that using alternative labels wouldn't provide significant benefits for a language model, since the tokens would be somewhat equal. However, it’s always better to test rather than rely on assumptions.
I chose hkunlp/instructor-base
because it's designed with prompts in mind, generating embeddings that serve a purpose. According to its documentation, it doesn’t require fine-tuning, which is an advantage. It’s also lightweight enough to run on a CPU (though there are larger models in the same series).
We haven't yet evaluated the tool’s performance. If it doesn’t meet our needs in terms of quality or speed (since we’re working with limited resources), we’re considering an alternative approach. This would involve creating embeddings using all-MiniLM-L6-v2
, but with some fine-tuning on a job description and skill dataset to enhance performance.
Have a nice day you too!
PS: Can't wait to visit Italy again and sprinkle the parmesan!!!
The tool is part of a larger project I’m currently working on.
Cool! Feel free to check if the esco-playground works for you. You can even try our API container https://hub.docker.com/repository/docker/ioggstream/esco-api/general
At this stage, I’m just embedding the primary labels. Initially, I assumed that using alternative labels wouldn't provide significant benefits for a language model, since the tokens would be somewhat equal. However, it’s always better to test rather than rely on assumptions.
Our project focuses on ICT skills, and primary labels are not enough if you want to match stuff like products (ansible, oracle database, ...). Moreover, vectorial search provides ~10% (reasonably) false positives, so we are trying to improve the NER algorithm.
I chose
hkunlp/instructor-base
because it's designed with prompts in mind, generating embeddings that serve a purpose. According to its documentation, it doesn’t require fine-tuning, which is an advantage. It’s also lightweight enough to run on a CPU (though there are larger models in the same series).
I suggest to distribute the embeddings directly in the package (e.g., we do it in https://github.com/par-tec/esco-playground/blob/main/esco/esco.json.gz) so that the module would be faster and more predictable.
We haven't yet evaluated the tool’s performance. If it doesn’t meet our needs in terms of quality or speed (since we’re working with limited resources), we’re considering an alternative approach. This would involve creating embeddings using
all-MiniLM-L6-v2
, but with some fine-tuning on a job description and skill dataset to enhance performance.
Do you have any assessment tool for quality/speed? In our repo, you can find some tests in the form of "text", "expected_skills".
PS: Can't wait to visit Italy again and sprinkle the parmesan!!!
Cool! While you can find Parmigiano everywhere, if you are into food you should visit Bologna.
Question
Hi @KonstantinosPetrakis, your project seems interesting!
In https://github.com/par-tec/esco-playground I achieve similar goals using different approaches:
Currently I embed using all-MiniLM-L12-v2 + all the labels +
rdfs:comment
and store the embeddings in the package. Which are the advantages of using hkunlp/instructor-base ?Have a nice day, R.