Open yonglin-wang opened 2 years ago
Hello @yonglin-wang Thank you for this detailed issue. Actually, we had the same idea back when we started SkillNer, but we put it on hold to see how it works first with a simple hard-coded list.
But yes, as you said, it will be good to put in a pipeline to generate the two JSONs token_dist.json
and skill_db_relax_20.json
What's next ?
What do you think ?
Thank you for the prompt followup @AnasAito ! Yes, it sounds like an awesome plan to me. I'll keep an eye out for the MD file and start building the pipeline once it's ready.
Having a generalized skill metadata sounds like a great idea too, but personally I would probably give it a slightly lower priority than generating the two files, just because skill type seems to be quite a common tag in the other skill lists I've seen as well.
Looking forward to the collaboration 😁
Hello @yonglin-wang , I finished the MD file with code utils that will speed your pipeline creation . I guess it will be strain forward now to generate the two files. check this : how_new_db Happy coding !
Hi 👋 Thanks for this great repo--I really liked how smart the tool is, especially being able to extract "Project Management" from the phrase "manage projects". I'd love to hear what you think about the following use case:
Is your feature request related to a problem? Please describe. I am looking to use your tool with a custom skill list other than EMSI, e.g. O*NET skill lists
Describe the solution you'd like
skill_db_relax_20.json
andtoken_dist.json
files for custom skill lists would also be much appreciated.Describe alternatives you've considered I have traced the code a little bit, and found that we would probably need
skill_db_relax_20.json
, which seems to be generated withskills_processor/create_surf_db.py
based ontoken_dist.json
andskills_processed.json
, andtoken_dist.json
, which seems to be generated withskills_processor/create_token_dist.py
based onskill_db_relax_20.json
My questions are:
skills_processed.json
is generated? More specifically, what are the rules (or data sources) that determine the following fields:unique_token
,match_on_stemmed
?skill_db_relax_20.json
andtoken_dist.json
seems to be circular--they require each other to be generated.. What should be the correct order?token_dist.json
could be generated first, withn_grams
in this line being a list of strings of lowered, lemmatized skill titles (only if skill title is more than 1 word; otherwise it's the lowered skill title without the parenthesis).Additional context Once the two questions are resolved, I would be happy to write a modularized script that generates
skills_processed.json
,skill_db_relax_20.json
, andtoken_dist.json
from any given skill list/table, and create a pull request for it.Looking forward to hearing from you 😃