AkshayO-95 commented 4 years ago

I am training a blank spacy model to create a NER for recognizing three entities. I have 300 annotated sentences. The drop rate that I am using is 0.35 and I ran the model with 500 iterations. I still get an empty list when I try to print the entities. The structure of the sentences is very easy - example would be : "10 boxes of salad, eggs and bacon" and the entity labels are Quantity, Unit , Food. What could be the problem? Also is there any other approach that I should be taking to solve this problem? PS: I am following the code as provided in the Spacy documentation on the website.

Your Environment

Operating System: Windows 10
Python Version Used: 3.7.8
spaCy Version Used: 2.3.2
Environment Information:

svlandeg commented 4 years ago

Could you try training the model on just one or two sentences, and check whether it's able to overfit on those? You would expect the loss to go to zero pretty quickly, and after training, the model should make the correct predictions on those one or two sentences. Let us know if that works.

It would also be helpful if you could paste a minimal training script with one or two data samples hardcoded in the script, so we can run it on our end and help check whether maybe there's anything wrong with your annotations or code. Also the output of your loss throughout the iterations would be helpful.

The other thing you can try, is to run python -m spacy debug-data on your data, to verify for any potential errors.

AkshayO-95 commented 4 years ago

So something I realized is that a lot of my sentences are getting the misalignment entities warning. I realized that few of my entities are bi-grams - for example : "CHICKEN TACOS" and there is a white space between these two words which might confuse the tokenizer and it gives me misalignment error. How can I solve this problem?

Also here is the code snippet:

`#SAMPLE TRAINING DATA train_data_sample= [('4 SMALL BAGS OF ASSORTED PREPARED FOOD', {'entities': [(0, 1, 'NUMBER'), (2, 12, 'UNIT'), (16, 38, 'FOOD')]}),('5 TRAYS OF CHICKEN TACOS', {'entities': [(11, 24, 'FOOD'), (2, 7, 'UNIT'), (0, 1, 'NUMBER')]})]

Training a blank spacy model

nlp=spacy.blank("en") ner= nlp.create_pipe('ner') nlp.add_pipe(ner) nlp.begin_training()

Add the new label to ner

ner.add_label('FOOD') ner.add_label('UNIT') ner.add_label('NUMBER')

Resume training

optimizer = nlp.resume_training() move_names = list(ner.move_names)

Disable pipeline components you don't need to change

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Import requirements

import random from spacy.util import minibatch, compounding from pathlib import Path

TRAINING THE MODEL

with nlp.disable_pipes(*unaffected_pipes):

Training for 30 iterations

for iteration in range(10):

shuufling examples before every iteration

random.shuffle(train_data_sample)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data_sample, size=compounding(1.0,2.0,1.001))
for batch in batches:
    texts, annotations = zip(*batch)
    nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.35,# dropout
                sgd= optimizer,
                losses=losses,
            )
print("Losses", losses)

Testing the model

doc = nlp("5 TRAYS of CHICKEN TACOS") print("Entities", [(ent.text, ent.label_) for ent in doc.ents])`

adrianeboyd commented 4 years ago

(To format the code in your post, please use code blocks with three backticks on a separate line before and after the code.)

Now that you've fixed the offsets, this seems to work as expected if you increase the number of iterations. I think 10 iterations just isn't enough when you have 3 labels and only 1 example per iteration to start with in your current batching config.

github-actions[bot] commented 4 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

NER Custom Training Model returns empty list even after 300 sentences #5825