Closed mrxiaohe closed 5 years ago
I am still new to NLP but I guess you could use NER to train a model which knows the context in which your account numbers appear.
that is, if I train on 2000 sentences, I don't want the model to simply learn the specific sequences of numbers present in the 2000 sentences
You can try and see if this is something the model is able to learn. You probably want to start off with a blank model, so your category doesn't conflict with the other number entity types the model aready predicts.
Alternatively, you might want to look into an approach that combines the statistical model's predictions with rules to make the more fine-grained distinction. See here: https://spacy.io/usage/rule-based-matching#models-rules For instance, you could have a more generic entity recognizer to predict numbers and maybe other related noun phrases. Based on that, you then use the dependency parse and rules to analyse the context and decide whether it's an account number.
If I was you I would think if data augmentation, you could take the 2000 sentences and create more sentences by replacing sequences by randomly generating account numbers.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Your Environment
Suppose that I have a lot of emails in which people mention certain account numbers associated with an online service. It could be something along the line of the following (the sequence of #'s represents account numbers):
These account numbers follow certain rules that can be captured by regex (not just any sequence of 9 digits). However, there are also many other number sequences that can be captured by the same regex -- notably, some conference call services' access codes follow the same patterns, resulting many false positives.
So this leads me to wonder if it is possible to use spaCy to train on custom data like the sentences show above to identify and extract these account numbers. Crucially, I don't want the model to learn specific sequences of number combinations -- that is, if I train on 2000 sentences, I don't want the model to simply learn the specific sequences of numbers present in the 2000 sentences.
Thanks!