Closed kevinrosenberg21 closed 6 years ago
@kevinrosenberg21 how do you segment the resume? Are you splitting the paragraphs using a statistical model or keywords based?
@damianoporta I'm splitting the resume into K semantically-related segments by using a word2vec trained on our domain. Depending on K and how many sections there are in that resume, the segmentation matches the sections.
Thanks @kevinrosenberg21 At the moment i do not think Sapcy is able to handle custom ner features. (Maybe via Cython) I think adding this kind of feature could help to increase the accuracy. If i have understood you correctly you have trained a W2V model to get a cloud of keywords that identify the most common words used inside a job experience. Right? the same for personal data, hobbies etc?
@damianoporta No, it's completely unsupervised, so the segmentation algorithm doesn't know that it's segmenting a resume into its sections, it's just doing its thing using a custom-trained word2vec and the result happens to be the sections, if you set the number of segments right.
@kevinrosenberg21 is it possible to provide more details on how you implemented this? I can understand word2vec part, but not the segmentation algorithm (I am very new to NLP).
The answer is "it's complicated", unfortunately --- Damiano is right that it's currently not easy to add features to the NER model. I've been thinking about how to improve that.
I answered a similar question on the Prodigy support forum here: https://support.prodi.gy/t/incorporating-custom-position-feature-into-ner/160
I also added an example of using multi-task learning to try to incorporate this type of knowledge: https://github.com/explosion/spaCy/blob/master/examples/training/ner_multitask_objective.py
@kevinrosenberg21 cc @loremaps i'm kind of in the same boat as loremaps. Would you be willing to give a little bit of super high level detail to your implementation?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi, everyone.
I'm on an Ubuntu machine with Python 3.5.2 and spaCy 2.0. I'm training a blank Spanish model to recognize entities in resumes. For that I used custom word embeddings and I'm doing a large entity annotation project. I was able to segment a resume and find out which section of the resume the segment belongs to using the word embeddings and I wanna use that knowledge to augment spaCy's NER (for example, if an entity belongs to the work experience section it's more likely to be an organization than an educational institution). I was looking through the documentation and while I saw that there's a way to add custom attributes and/or calculate them using pipelines and extensions I was unable to tell whether the NER algorithm will use them as features by default or if I need to add custom code to it.
Thank you, and regards.