explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Spacy NER Training ner.BiluoPushDown.set_costs produces ValueError #7649

Closed JulianGerhard21 closed 3 years ago

JulianGerhard21 commented 3 years ago

How to reproduce the behaviour

I followed the instructions based on your website to train my own NER model. To do so, I:

  1. Created a NER dataset using the old scheme [('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]})]
  2. I used the following function to convert the dataset to the new scheme:
    nlp = spacy.blank("de")  # load a new spacy model
    db = DocBin()  # create a DocBin object

    for text, annot in tqdm(train_data):  # data in previous format
        doc = nlp.make_doc(text)  # create doc object from text
        ents = []
        for start, end, label in annot["entities"]:  # add character indexes
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents  # label the text with the ents
        db.add(doc)

    db.to_disk("./train.spacy")  # save the docbin object
  1. I created a config (attached) via your website (german, ner, accuracy)
  2. I used the following command to fill the config: python -m spacy init fill-config base_config.cfg config.cfg
  3. I used the following command to actually start training (using the same training data as dev data as an example) python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy --gpu-id 0

Spacy produced the following output:

=========================== Initializing pipeline ===========================
[2021-04-03 18:02:04,316] [INFO] Set up nlp object from config
[2021-04-03 18:02:04,323] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-04-03 18:02:04,325] [INFO] Created vocabulary
[2021-04-03 18:02:04,325] [INFO] Finished initializing nlp object
[2021-04-03 18:03:05,600] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     79.22   18.02   12.59   31.70    0.18

and then the following error:

Traceback (most recent call last):
  File "C:\Users\\anaconda3\envs\gpuenv\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\\anaconda3\envs\gpuenv\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\cli\_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\typer\main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\cli\train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\training\loop.py", line 114, in train
    raise e
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\training\loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\training\loop.py", line 194, in train_while_improving
    nlp.update(
  File "C:\Users\\anaconda3\envs\gpuenv\lib\site-packages\spacy\language.py", line 1107, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy\pipeline\transition_parser.pyx", line 366, in spacy.pipeline.transition_parser.Parser.update
  File "spacy\pipeline\transition_parser.pyx", line 478, in spacy.pipeline.transition_parser.Parser.get_batch_loss
  File "spacy\pipeline\_parser_internals\ner.pyx", line 310, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs
ValueError

config.txt

Your Environment

JulianGerhard21 commented 3 years ago

Just wanted to let you know, that in addition, I tried the data conversion with:

    examples = []
    nlp = spacy.blank("de")
    for text, annots in train_data_old:
        examples.append(Example.from_dict(nlp.make_doc(text), annots))
    db = DocBin(docs=[ex.reference for ex in examples])
    db.to_disk("train.spacy")

with the same result as mentioned above.

JulianGerhard21 commented 3 years ago

Nevermind, I was able to resolve the problem by reading this post carefully.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.