KonstantinosPetrakis / esco-skill-extractor

Extract ESCO skills and ISCO occupations from texts such as job descriptions or CVs
https://pypi.org/project/esco-skill-extractor/
MIT License
4 stars 0 forks source link

Question: differences with esco-playground #1

Closed ioggstream closed 2 months ago

ioggstream commented 2 months ago

Question

Hi @KonstantinosPetrakis, your project seems interesting!

In https://github.com/par-tec/esco-playground I achieve similar goals using different approaches:

  1. use spacy NER to get keywords / product
  2. then use embeddings to integrate the above results.

Currently I embed using all-MiniLM-L12-v2 + all the labels + rdfs:comment and store the embeddings in the package. Which are the advantages of using hkunlp/instructor-base ?

Have a nice day, R.

KonstantinosPetrakis commented 2 months ago

Thank you for your interest, @ioggstream,

The tool is part of a larger project I’m currently working on. At this stage, I’m just embedding the primary labels. Initially, I assumed that using alternative labels wouldn't provide significant benefits for a language model, since the tokens would be somewhat equal. However, it’s always better to test rather than rely on assumptions.

I chose hkunlp/instructor-base because it's designed with prompts in mind, generating embeddings that serve a purpose. According to its documentation, it doesn’t require fine-tuning, which is an advantage. It’s also lightweight enough to run on a CPU (though there are larger models in the same series).

We haven't yet evaluated the tool’s performance. If it doesn’t meet our needs in terms of quality or speed (since we’re working with limited resources), we’re considering an alternative approach. This would involve creating embeddings using all-MiniLM-L6-v2, but with some fine-tuning on a job description and skill dataset to enhance performance.

Have a nice day you too!

PS: Can't wait to visit Italy again and sprinkle the parmesan!!!

ioggstream commented 2 months ago

The tool is part of a larger project I’m currently working on.

Cool! Feel free to check if the esco-playground works for you. You can even try our API container https://hub.docker.com/repository/docker/ioggstream/esco-api/general

At this stage, I’m just embedding the primary labels. Initially, I assumed that using alternative labels wouldn't provide significant benefits for a language model, since the tokens would be somewhat equal. However, it’s always better to test rather than rely on assumptions.

Our project focuses on ICT skills, and primary labels are not enough if you want to match stuff like products (ansible, oracle database, ...). Moreover, vectorial search provides ~10% (reasonably) false positives, so we are trying to improve the NER algorithm.

I chose hkunlp/instructor-base because it's designed with prompts in mind, generating embeddings that serve a purpose. According to its documentation, it doesn’t require fine-tuning, which is an advantage. It’s also lightweight enough to run on a CPU (though there are larger models in the same series).

I suggest to distribute the embeddings directly in the package (e.g., we do it in https://github.com/par-tec/esco-playground/blob/main/esco/esco.json.gz) so that the module would be faster and more predictable.

We haven't yet evaluated the tool’s performance. If it doesn’t meet our needs in terms of quality or speed (since we’re working with limited resources), we’re considering an alternative approach. This would involve creating embeddings using all-MiniLM-L6-v2, but with some fine-tuning on a job description and skill dataset to enhance performance.

Do you have any assessment tool for quality/speed? In our repo, you can find some tests in the form of "text", "expected_skills".

PS: Can't wait to visit Italy again and sprinkle the parmesan!!!

Cool! While you can find Parmigiano everywhere, if you are into food you should visit Bologna.