explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

Training custom NER model but cannot get into the training loop #12413

Closed Abe410 closed 1 year ago

Abe410 commented 1 year ago

Hi

So I am trying to create a custom NER model, and following the steps as follows:

I have got the training date with the text examples and the tags along with start and end indices.

Now I run the following code:

from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()

from spacy.util import filter_spans

for training_example  in tqdm(training_data): 
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents 
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy") 

!python -m spacy init fill-config base_config.cfg config.cfg

!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy 

The output that I should be geting is:

ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-07-01 18:31:37,021] [INFO] Set up nlp object from config
[2022-07-01 18:31:37,041] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-07-01 18:31:37,047] [INFO] Created vocabulary
[2022-07-01 18:31:40,116] [INFO] Added vectors: en_core_web_lg
[2022-07-01 18:31:43,239] [INFO] Finished initializing nlp object
[2022-07-01 18:31:45,876] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.29    0.49    0.64    0.39    0.00
  7     200        501.32   3113.23   78.43   78.12   78.74    0.78
✔ Saved pipeline to output directory
model-last

But what I get is

ℹ Saving to output directory: .
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-03-14 02:40:38,422] [INFO] Set up nlp object from config
[2023-03-14 02:40:38,441] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-03-14 02:40:38,445] [INFO] Created vocabulary
[2023-03-14 02:42:09,609] [INFO] Added vectors: en_core_web_lg

And then the cell stops executing for the jupyter notebook. What could be the case here? I do not get any error messages or anything.

The only change I did to the config file is the batch size to 80 and training epochs to 300.

Any help?

Your Environment

`- spaCy version: 3.5.1

kadarakos commented 1 year ago

Hey Abe410,

From this information alone its difficult to understand what the issue could be. It would be helpful if you posted the full config file if possible. One thing that you could already try potentially is to limit your training and development data sets to very few samples and see whether it works.

OthmanMohammad commented 1 year ago

Hello,

It seems that the training process is taking a long time to execute or might be stuck for some reason. Here are a few suggestions to help you debug the issue:

Ensure you have enough memory (RAM) available for the training process. The en_core_web_lg vectors can consume a significant amount of memory, and if your system is running out of memory, it can cause the training to slow down or hang. You may consider using a smaller model like en_core_web_sm or en_core_web_md for initial experiments.

Decrease the batch size to a smaller value, such as 32 or 16, which can help with memory constraints and might make the training process faster.

Try running the training process outside the Jupyter notebook, directly from the command line. Sometimes, Jupyter notebooks can cause issues with long-running tasks or consume more memory than necessary.

Monitor the system resources (e.g., RAM, CPU usage) during the training process to see if any bottleneck is causing the issue.

You mentioned that you changed the training epochs to 300. This might be an overkill for your dataset and can cause overfitting. Try reducing the number of epochs to a smaller value, such as 30 or 50, and see if the training process completes successfully.

Here's a revised version of your code that utilizes en_core_web_md instead of en_core_web_lg:

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

nlp = spacy.blank("en")  # load a new spacy model
doc_bin = DocBin()

for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

!python -m spacy init fill-config base_config.cfg config.cfg

# Modify the following line in your config.cfg file:
# [initialize.components.tok2vec.model.embed] -> "vectors": "en_core_web_md"

!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy