explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Training NER system with language models and spacy pretrained vectores #5350

Closed KIRNESH closed 4 years ago

KIRNESH commented 4 years ago

I am training a NER system related to a specific data set. Since the data is very badly and less annotated, I tried using "spacy pretrain" cli command to generate vectors.
But I want to use it in language model training as in nlp.update() command. I followed the code described in this GitHub link https://github.com/explosion/spaCy/issues/3448 The code looks like this

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner", config={"architecture": "simple_cnn", "exclusive_classes": True}))
ner.add_label("LABEL1")
ner.add_label("LABEL2")
# Alternatively, instead of adding all your labels explicitly, you could pass all your examples
# into nlp.begin_training, like this: nlpbegin_training(get_gold_tuples=lambda: my_data)
# It's fine to add the labels and not pass in the data, though. The nlp.begin_training() method will
# work the same as well, if you have other components in your pipeline you want to train.
optimizer = nlp.begin_training()
# Now that we have our model, we can load in the pretrained weights.
with open(path_to_pretrained_weights, "rb") as file_:
    ner.model.tok2vec.from_bytes(file_.read())
# Now we can proceed with training

for epoch in range(nr_epoch):
    random.shuffle(train_data)
    for batch in minibatch(train_data, size=batch_size):
        X, y = zip(*batch)
        nlp.update(X, y, sgd=optimizer)

My questions are:

  1. Is the process right?
  2. I trained the spacy pretrain command with en_core_web_md vectors , so should I start with spacy.load('en_core_web_md') rather than spacy.blank('en')
  3. Also, for "spacy pretrain" my loss went to around 600 for 1000 iterations, what is the significant loss which the loss should go to?
KIRNESH commented 4 years ago

Hi, Do you have any updates on this?

svlandeg commented 4 years ago

Hi @KIRNESH, apologies for the late follow-up. In general it's probably better to ask more generic questions on StackOverflow, where there is a larger community. It also helps us to keep this tracker focused on bug reports and feature requests.

In general, your process looks fine. The spaCy pretrain command will basically learn from the vectors you provided (from en_core_web_md) and use the result as the internal Tok2Vec layer. You need to start from a blank model for this (like you do in the code snippet), because you can't just change the underlying Tok2Vec layer and expect other parts of the pretrained components to still work correctly.

In general it's virtually impossible to say what the loss should look like - it really depends on the size of your datasets and hyperparameters you're using. You want to monitor the loss across training iterations: if it stops decreasing significantly, the training process has hit its limits. You can definitely experiment with different hyperparameters for your model and training loops, but ideally you'd evaluate the performance of those on a downstream task, e.g. some NER challenge where you evaluate accuracy on a hold-out test set. That will give you a realistic idea on how the pretraining helps (or doesn't!). See also this blog post for more background information, and this user blog post for an example.

Hope that helps !

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.