explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

Open tomateit opened 2 years ago

tomateit commented 2 years ago

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter during training:

File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/language.py", line 1122, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy/pipeline/transition_parser.pyx", line 416, in spacy.pipeline.transition_parser.Parser.update
  File "spacy/ml/parser_model.pyx", line 293, in spacy.ml.parser_model.ParserStepModel.finish_steps
  File "spacy/ml/parser_model.pyx", line 456, in spacy.ml.parser_model.precompute_hiddens.begin_update.backward
  File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/ml/_precomputable_affine.py", line 49, in backward
    Xf = X[ids]
IndexError: index 221 is out of bounds for axis 0 with size 221

How to reproduce the behaviour

I created my custom span_getter: https://gist.github.com/tomateit/06e53b108f764e7240ea7ae8e2e830fd It adapts number of words to respective number of word pieces, to better fit into transformer window. Pipeline works with this function, the exception is thrown only at some documents.

I plug it into simple transformer + ner pipeline like this: https://github.com/tomateit/natasha-spacy/blob/transformer-pipeline/project/config_trf.cfg (in my tests I disabled all but transformer and NER) This error is emitted at the line https://github.com/explosion/spaCy/blob/master/spacy/ml/_precomputable_affine.py#L49

Your Environment

polm commented 2 years ago

Thanks for the report and sorry it's taken us a long time to follow up on this. Unfortunately, because the issue is happening deep in the spaCy internals and your custom code isn't very simple, it's hard to be sure what's going on here.

Can you create a small example we can run to reproduce the problem? A repo like the one you linked to with a project file would be great, but that repo's project file doesn't seem to work and doesn't use Transformers anyway.

tomateit commented 2 years ago

Thanks for your reply. I reproduced the behavior based on one of spaCy tutorials: https://github.com/tomateit/tutorial_spacy_custom_span_getter The only changes I do are:

And the error remains. P.S. The repo I linked in my first message does use transformer config, in project file it's called by "train_trf" and not "train" - to be able to use both configs.