explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

need help training spacy model #1805

Closed vtangutoori closed 6 years ago

vtangutoori commented 6 years ago

Hi, I am trying to train spacys name recognition using a training set of my own, i have used the code provided in the official website, i am a newby to python and machine learning, can someone please tell me where i am going wrong.

i have taken the below code from spacy's github page and replaced the training data and calling the function as below.

function call

main(model='en', output_dir=None, n_iter=5)

spacy code(https://github.com/explosion/spacy/blob/master/examples/training/train_ner.py)

"""Example of training spaCy's named entity recognizer, starting off with an existing model or a blank model. For more details, see the documentation:

import plac import random from pathlib import Path import spacy

training data

TRAIN_DATA =name_set

@plac.annotations( model=("en", "option", "m", str), output_dir=("C:/Python27/Python-Data-Science-and-Machine-Learning-Bootcamp/Machine Learning Sections/my work", "option", "o", Path), n_iter=(5, "option", "n", int)) def main(model=None, output_dir=None, n_iter=100): """Load the model, set up the pipeline and train the entity recognizer.""" if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank('en') # create blank Language class print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
    ner = nlp.get_pipe('ner')

# add labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            nlp.update(
                [text],  # batch of texts
                [annotations],  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights
                losses=losses)
        print(losses)

# test the trained model
for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
    print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    for text, _ in TRAIN_DATA:
        doc = nlp2(text)
        print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
        print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if name == 'main':

plac.call(main)

Your Environment

I am using python 3.X and latest version of spacy.

ahalterman commented 6 years ago

It looks like you're using an outdated version of spaCy (1.0). Try upgrading to the most recent (2.0.5) and see if that clears it up. You'll also need to make sure that your data is in the format that spaCy expects.

vtangutoori commented 6 years ago

Hi @ahalterman, Thank you for the response, i have updated the spacy version and ran the training, but it took 8 hours and still did not complete training so i shut the kernel down, do you know any resources where i can check some examples on how to tune the training process so that i can get an understanding, i am a newby to the ML and AI fields.

sahasrara62 commented 6 years ago

hey @vtangutoori i training process is fine , what goes wrong for is testing part where you are testing again the whole train data .

ahalterman commented 6 years ago

It's true that it's not that fast to train if you have a lot of data. To better visualize what's going on, you can change this line from

for text, annotations in TRAIN_DATA:

to

for text, annotations in tqdm(TRAIN_DATA):

to get a progress bar for each iteration. The earliest iteration is the slowest because it uses batch size 1, but at least you'll be able to see whether it's moving along and how long it may take. Don't forget to add

from tqdm import tqdm

at the top of train_ner.py

ines commented 6 years ago

To add to @ahalterman's comment: The training examples in the examples directory are mostly intended to be self-contained scripts that you can run and test quickly. They're not really optimised to work with large datasets – for example, they don't use batching.

So once you're getting "serious" about training your model, you might want to use the built-in spacy train command instead – see here for the documentation. You can find the full implementation of spacy train here: https://github.com/explosion/spaCy/blob/master/spacy/cli/train.py

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.