explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Tagger and parser disappear when NER is added to model #6140

Closed Nuccy90 closed 4 years ago

Nuccy90 commented 4 years ago

I trained a tagger and parser for Swedish on UD and then I wanted to add NER on top of the same model. I trained the NER model on different data from the command line as follows:

python -m spacy train sv models/tagger_parser_ner json-files/suc3_train.json json-files/suc3_dev.json --base-model models/tagger_parser/model-best --pipeline 'ner' -ne 5

When I went to load the model and test it, it only does NER (the tagger and parser are gone). In the folder for the new model I can see three folders called ner, tagger and parser, so it did take them from the previous model, but "pipeline" in meta.json says only ["ner"]. If I manually change "pipeline" to ["tagger, parser, ner"] I get this error message:

KeyError: "[E002] Can't find factory for 'tagger,parser,ner'. This usually happens when spaCy calls nlp.create_pipe with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to Language.factories['tagger,parser,ner'] or remove it from the model meta and add it via nlp.add_pipe instead."

I was looking at PR #4911 and it says "When the final model and best model are saved, disabled components are reenabled and the meta information is merged to include the full pipeline". This did not happen for me. Did I do something wrong or is this still not entirely fixed?

Your Environment

adrianeboyd commented 4 years ago

That does look like a bug with how the base model settings are included in the train CLI.

I think you can update the meta to get your current model to work, but the pipeline needs to look like this with separate component names as a JSON list, not like how you wrote it on the commandline:

  "pipeline":[
    "tagger",
    "parser",
    "ner"
  ],

I think we need to revert #5751. The commandline meta is used as the base meta for the models, but will have changes in it based on the training options that shouldn't be overridden, at least not like this.

Probably we need to update the model meta from the commandline meta in a final step rather than using it as the base meta. Then if your commandline meta breaks parts of the pipeline config, etc., presumably you know how to handle it, but if you haven't provided a commandline meta, you get something sensible.

adrianeboyd commented 4 years ago

No, I was wrong above. The meta just needs to be updated to reflect the reenabled pipeline components. Otherwise I think it's okay as it is.

svlandeg commented 4 years ago

I think you're right Adriane, just a simple update of meta["pipeline"] seems to do the trick.

When I first train on a tagger, than on a parser, following the steps from the original post, I get:

print(nlp.pipeline)

[('parser', <spacy.pipeline.pipes.DependencyParser object at 0x00000259D5224048>)]

With the fix in PR #6219, I get:

print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x00000259D67E0FD0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x00000259D55D7888>)]

adrianeboyd commented 4 years ago

The final meta isn't 100% right (the evaluation and labels for the disabled components go missing), but this is also basically the solution that I had, and the model should work, at least. I think it's good enough? Merging everything successfully with a provided meta would be tricky.

svlandeg commented 4 years ago

Yea, and I think this will be addressed more properly in v3, right...

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.