explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.73k stars 4.36k forks source link

Adding domain knowledge (custom features) to NER #1827

Closed kevinrosenberg21 closed 6 years ago

kevinrosenberg21 commented 6 years ago

Hi, everyone.

I'm on an Ubuntu machine with Python 3.5.2 and spaCy 2.0. I'm training a blank Spanish model to recognize entities in resumes. For that I used custom word embeddings and I'm doing a large entity annotation project. I was able to segment a resume and find out which section of the resume the segment belongs to using the word embeddings and I wanna use that knowledge to augment spaCy's NER (for example, if an entity belongs to the work experience section it's more likely to be an organization than an educational institution). I was looking through the documentation and while I saw that there's a way to add custom attributes and/or calculate them using pipelines and extensions I was unable to tell whether the NER algorithm will use them as features by default or if I need to add custom code to it.

Thank you, and regards.

damianoporta commented 6 years ago

@kevinrosenberg21 how do you segment the resume? Are you splitting the paragraphs using a statistical model or keywords based?

kevinrosenberg21 commented 6 years ago

@damianoporta I'm splitting the resume into K semantically-related segments by using a word2vec trained on our domain. Depending on K and how many sections there are in that resume, the segmentation matches the sections.

damianoporta commented 6 years ago

Thanks @kevinrosenberg21 At the moment i do not think Sapcy is able to handle custom ner features. (Maybe via Cython) I think adding this kind of feature could help to increase the accuracy. If i have understood you correctly you have trained a W2V model to get a cloud of keywords that identify the most common words used inside a job experience. Right? the same for personal data, hobbies etc?

kevinrosenberg21 commented 6 years ago

@damianoporta No, it's completely unsupervised, so the segmentation algorithm doesn't know that it's segmenting a resume into its sections, it's just doing its thing using a custom-trained word2vec and the result happens to be the sections, if you set the number of segments right.

loremaps commented 6 years ago

@kevinrosenberg21 is it possible to provide more details on how you implemented this? I can understand word2vec part, but not the segmentation algorithm (I am very new to NLP).

honnibal commented 6 years ago

The answer is "it's complicated", unfortunately --- Damiano is right that it's currently not easy to add features to the NER model. I've been thinking about how to improve that.

I answered a similar question on the Prodigy support forum here: https://support.prodi.gy/t/incorporating-custom-position-feature-into-ner/160

I also added an example of using multi-task learning to try to incorporate this type of knowledge: https://github.com/explosion/spaCy/blob/master/examples/training/ner_multitask_objective.py

matthewchung74 commented 6 years ago

@kevinrosenberg21 cc @loremaps i'm kind of in the same boat as loremaps. Would you be willing to give a little bit of super high level detail to your implementation?

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.