explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.84k stars 4.38k forks source link

Is there a way to train on custom data to recognize number sequences in specific context? #4029

Closed mrxiaohe closed 5 years ago

mrxiaohe commented 5 years ago

Your Environment

Suppose that I have a lot of emails in which people mention certain account numbers associated with an online service. It could be something along the line of the following (the sequence of #'s represents account numbers):

These account numbers follow certain rules that can be captured by regex (not just any sequence of 9 digits). However, there are also many other number sequences that can be captured by the same regex -- notably, some conference call services' access codes follow the same patterns, resulting many false positives.

So this leads me to wonder if it is possible to use spaCy to train on custom data like the sentences show above to identify and extract these account numbers. Crucially, I don't want the model to learn specific sequences of number combinations -- that is, if I train on 2000 sentences, I don't want the model to simply learn the specific sequences of numbers present in the 2000 sentences.

Thanks!

BreakBB commented 5 years ago

I am still new to NLP but I guess you could use NER to train a model which knows the context in which your account numbers appear.

ines commented 5 years ago

that is, if I train on 2000 sentences, I don't want the model to simply learn the specific sequences of numbers present in the 2000 sentences

You can try and see if this is something the model is able to learn. You probably want to start off with a blank model, so your category doesn't conflict with the other number entity types the model aready predicts.

Alternatively, you might want to look into an approach that combines the statistical model's predictions with rules to make the more fine-grained distinction. See here: https://spacy.io/usage/rule-based-matching#models-rules For instance, you could have a more generic entity recognizer to predict numbers and maybe other related noun phrases. Based on that, you then use the dependency parse and rules to analyse the context and decide whether it's an account number.

thomasryck commented 5 years ago

If I was you I would think if data augmentation, you could take the 2000 sentences and create more sentences by replacing sequences by randomly generating account numbers.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.