Open tomateit opened 2 years ago
Thanks for the report and sorry it's taken us a long time to follow up on this. Unfortunately, because the issue is happening deep in the spaCy internals and your custom code isn't very simple, it's hard to be sure what's going on here.
Can you create a small example we can run to reproduce the problem? A repo like the one you linked to with a project file would be great, but that repo's project file doesn't seem to work and doesn't use Transformers anyway.
Thanks for your reply. I reproduced the behavior based on one of spaCy tutorials: https://github.com/tomateit/tutorial_spacy_custom_span_getter The only changes I do are:
And the error remains. P.S. The repo I linked in my first message does use transformer config, in project file it's called by "train_trf" and not "train" - to be able to use both configs.
EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter during training:
How to reproduce the behaviour
I created my custom span_getter: https://gist.github.com/tomateit/06e53b108f764e7240ea7ae8e2e830fd It adapts number of words to respective number of word pieces, to better fit into transformer window. Pipeline works with this function, the exception is thrown only at some documents.
I plug it into simple transformer + ner pipeline like this: https://github.com/tomateit/natasha-spacy/blob/transformer-pipeline/project/config_trf.cfg (in my tests I disabled all but transformer and NER) This error is emitted at the line https://github.com/explosion/spaCy/blob/master/spacy/ml/_precomputable_affine.py#L49
Your Environment