explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

Questions about training, using and best practices for NER #1875

Closed danielvy closed 6 years ago

danielvy commented 6 years ago

Hi,

I see that the github issues is also being used for help and asking questions, so I hope it’s ok to submit this here (I’ve seen similar questions on StackOverflow but they are mostly unanswered).

I am new to Spacy and NLP/ML in general (but an experienced programmer). I’ve seen Spacy get many praises so I decided to give it a try. I really like the practical approach focused on solving real problems (and not giving me tons of theoretical options I don’t yet understand) and the clear API and documentation.

I’ve read through the guides and examples and while I can’t say I understand everything, I think I know enough now to try and solve the following problem:

Problem description:

Given a short piece of text (a title, sentence or short paragraph) that may (or not) contain the name of [known] musician (singer, band) and the name of a musical composition (song, usually), I'd like to extract the artist/song names or an indication that the text doesn't contain any.

Data:

For training data I downloaded and cleaned/normalized the discogs data dump, it contains several millions of artist names, track names and the relations between them. I also scraped Wikipedia (via the API) for lists of hit parades from several countries over the years - a few thousands of artist/song pairs (have the advantage of being more accurate representation of popular artists/songs I’m going to find in the input text).

I also have few thousand texts of real data I want to analyze with NER (with more data regularly incoming).

Based on what I read, the plan is:

(using Spacy 2.0.5 with Python 3.6.4 on MacOS 10.13)

These may sound trivial to someone with NLP background but I wasn’t able to answer these questions from the docs and articles I’ve read:

  1. Is this the correct approach? Anything else I can do to improve results?
  2. How much train data do I need? The more the better or is there such thing as too much train data (noise)? Should I use the entire amount of artist/song pairs (millions) even though in practice NER will never encounter 95% of them?
  3. Which model do I use? For NER, should I work with blank model or with core_web model (sm/md/lg ?)? I understand that this is dependent on the task and train data but I couldn’t find any guidelines or rules of thumb how to select one (the code examples always take this as an option).
  4. Does the NER component depend or benefit from the other components? Or if I just need Named Entities (and not POS or Deps) I can disable the other components?
  5. All the training examples use iterations to train the model with the same (shuffled) data. Why is this needed if the train data is the same? What is the criteria for selecting the amount of iterations needed?
  6. How do I test/improve it? Just run NER on my data, manually fix/annotate the results and add them back to the train data and run the training process all over again? Or is there any best practices for this?

Sorry for the dump of questions, really appreciate any insight or feedback anyone can give.

Thanks!

cklat commented 6 years ago

If it is okay, I'd like to extend some of your questions, since I currently have a NER project for which I try to summarize german company webpages with the goal to fill some sort of template, consisting of the fields 'company name', 'founding date' and 'keywords' i.e. what services does the company offer. My planned approach on this would be to transfer learn the existing german NER model by learning new labels.

I have a list of ~2000 pages that I can try to crawl for some 'about-us'-text, however if this process should be automated, there comes a lot of unnecessary information with it. Currently, I'm using jusText to do the crawling in order to remove the boilerplate stuff from the crawled webpages.

So my questions would be, as some sort of addition to the above ones:

  1. I cannot have much of training data, since I have to label it myself after crawling. What bothers me most is that I don't want to have the text annotated that don't give any information about the company or are not in context about the company's services. Since many pages use the same webpage-template, there will be a lot of duplicated, unnecessary text. Can I provide these text sentences (that I don't want to have annotated) to the training as some sort of 'negative sample' to make the model learn not to annotate these kind of examples when they are encountered in testing?

Moreover, is there some sort of sentences or text that the spacy NER model cannot handle? From what I've read on the spacy tutorial pages, I have to provide representative examples for training. From my understanding of my problem that would be just providing any text I receive from crawling in raw format to training, without any preprocessing. That could be only noise very often. Anything that I'm missing here that should/shouldn't do? Which leads my to the next question:

  1. I have tried to run some training with a little training data. As I'm rather familiar with the standard ML workflow or the workflow of training a CNN in computer vision, I'm expecting to have a similar training process in NLP with some sort of validation set etc. . Is this also possible with the spacy NER model for transfer learning? Currently, I just see the loss output which doesn't improve at all, rather jumping up and down in a certain interval of figures. I there a way to provide some validation data that the model can be evaluated on after each iteration? Also is there a way to change the learning rate, the optimizer itself and all that jazz? And what's the Loss-Function for the NER model?

Sorry for the overwhelming amount of input at this point. But I hope you guys can give us two some good feedback regardings our questions.

Cheers!

hanumkl commented 6 years ago

Currently, I'm using Spacy training NER Model to predict locations in Bahasa Indonesia.

Related to danielsvy's question number 6: According to this link https://spacy.io/api/cli#evaluate, in Spacy v2.0 we can use command line helpers to evaluate model's accuracy and speed using JSON-formatted annotated data and it will visualize the prediction based on the trained-model. But from what I have tried in NER training spacy documentation, instead of using JSON-formatted, the train data annotation is using python tuple list like:

TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

I've checked in Spacy v1.0, there's an evaluate function in the train_ner_standalone.py to test validation data using the same annotation format with the train data above. Is there any way to use this function in Spacy v2.0? Or maybe, is there any better way to evaluate the model without changing the validation data format to JSON-formatted annotated data?

    def evaluate(self, examples):
        scorer = Scorer()
        for input_, annot in examples:
            gold = self.make_gold(input_, annot)
            doc = self(input_)
            scorer.score(doc, gold)
        return scorer.scores

Thank you so much for your help :)

jtoghrul commented 6 years ago

@danielvy thank you for questions. These are the questions that I am looking for as well.

Any help is very appreciated.

JeanClaude3 commented 6 years ago

Yes, I'm wondering this as well. I can even arrange for labeled training data, but have no idea if the data I'm inputting or hyperparameters I'm setting are making any difference on accuracy of the model.

At the moment, I only see the loss function which doesn't really change at all. Is there an NER example that includes calculations like accuracy to know if it's improving?

honnibal commented 6 years ago

Sorry for the long delay replying to this. Mostly, there's no hard-and-fast answers to these questions. I'll do my best:

  1. Is this the correct approach? Anything else I can do to improve results?

It seems reasonable, but there's always a number of different options to try.

  1. How much train data do I need? The more the better or is there such thing as too much train data (noise)? Should I use the entire amount of artist/song pairs (millions) even though in practice NER will never encounter 95% of them?

On well-behaved problems you can usually expect a log-linear relationship between data and accuracy. However, there's nothing to stop you from collecting a dataset for an ML problem where accuracy doesn't rise above chance. Similarly you could collect a dataset with a simple linear relationship between the features and the label. In this case you might reach your peak accuracy very quickly. So, there's no really reliable way to guess how much data you'll need.

  1. Which model do I use? For NER, should I work with blank model or with core_web model (sm/md/lg ?)? I understand that this is dependent on the task and train data but I couldn’t find any guidelines or rules of thumb how to select one (the code examples always take this as an option).

If you have very few examples (e.g. under a few thousand) it might be good to start from the en_core_web_lg model. Otherwise you should probably start with a blank model with word vectors. Ultimately you can try both and see what works best.

  1. Does the NER component depend or benefit from the other components? Or if I just need Named 5. Entities (and not POS or Deps) I can disable the other components?

You can disable the other components.

  1. All the training examples use iterations to train the model with the same (shuffled) data. Why is this needed if the train data is the same? What is the criteria for selecting the amount of iterations needed?

You can stop iterating when the accuracy stops improving. Multiple passes is pretty fundamental to stochastic gradient descent. We're always making an approximation of the gradient of the whole dataset, and taking a small step in that direction. After an update on an example, we're not guaranteed to get it right if we immediately try again. So, after only one pass, the model will be far from convergence.

  1. How do I test/improve it? Just run NER on my data, manually fix/annotate the results and add them back to the train data and run the training process all over again? Or is there any best practices for this?

Yes. This is the workflow we have set up in Prodigy -- it works quite well.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.