NER fails on warning "Token indices sequence length is longer than the specified maximum"

schudoku commented 1 year ago

How to reproduce the behaviour

code:

text1="My name is robert" + " \r\n " 1000 + "My name is robert" text2="My name is robert " + "contact" 800 + " My name is robert" import spacy nlp = spacy.load("en_core_web_trf") doc = nlp(text1)

console output:

Token indices sequence length is longer than the specified maximum sequence length for this model (4011 > 512). Running this sequence through the model will result in indexing errors

NER performance: The NER performance of the last sentence degenerates:

My ORDINAL name ORDINAL is ORDINAL robert ORDINAL

analysis: When there are words or tokens that repeat massively, the warning is emitted. Unlike mentioned in other sources (https://github.com/explosion/spaCy/issues/6939) I think the impact is big. Before the occurence of the "big word" NER works normal but it fails on sentences directly following it. After some sentences the NER performance usually normalizes and works again. I am aware that spacy-transformers chunks big texts and thanks to that we can process long texts. But something is buggy and I think it has something to do with the chunking in the spacy-transformers library combined with the byte pair encoding used by the tokenizer of the trf model.

for i in doc._.trf_data.to_dict()["wordpieces"]["strings"]: print(len(i)) print(i)

As long as the bug exists, the warning should be emitted every time and not just once. Unfortunately I could not change that behavior.

Your Environment

spaCy version: 3.5.2
Platform: Windows-10-10.0.19044-SP0
Python version: 3.10.10
Pipelines: en_core_web_trf (3.5.0)

bdura commented 1 year ago

Hello @schudoku, the issue comes from the fact that spaCy tokens and the output of HuggingFace's tokenizer do not align exactly. In particular, \r\n...\r\n and contactcontact...contact are treated as single tokens in spaCy, but will be split by HuggingFace (although the exact split will depend on the tokenizer used). In your extreme case, the document will span 9 tokens in spaCy, while the HuggingFace tokenizer will output 4011 tokens (!).

To cope with this, spacy-transformers uses a truncation mechanism defined here. It takes a radical approach: drop wordpieces from the end. In your case, the last tokens of the document won't be fed to the model, hence the inconsistent result.

We mitigate this risk by using a low window size in the span getter. To wit, the en_core_web_trf configuration is:

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

In most cases, the 128-token window is sufficient to avoid the kind of issues you ran into.

You can see that we did think of other truncation strategies, but haven't had time to implement them.

schudoku commented 1 year ago

Hi @bdura! Thank you for the insights. I guess the mismatch between the tokenizers is no easy thing to solve. But wouldn´t it be possible to use the huggingface tokenizer first and try to align these tokens with the spacy tokens afterwards? Then we could chunk (window) the document for the huggingface model correctly and avoid the warning.

Regarding the current truncation mechanism I am not sure if I understand it correctly. When the chunk is too long for the model, is only the rest of the chunk dropped or the rest of the complete document? As I understand you it is the latter. But I have observed that the NER performance normalizes some sentences after the big word occurs. And if there is a mitigation mechanism in place, why is the huggingface warning emitted in the first place?

I think the library would benefit, when this could be handled correctly. Suppose we extract text from a PDF or Email. Suppose the first page contains only two words, the title. Wouldn´t the extracted text contain many repeating whitespace equivalent characters? I have observerd this with emails where the reported issue came to my attention first. Applications with real data would benefit greatly from an improvement here.

Regards, Daniel

bdura commented 1 year ago

When the chunk is too long for the model, is only the rest of the chunk dropped or the rest of the complete document?

Not exactly: this is done on a span-by-span basis.

In broad terms, the transformer component within spaCy will apply the following processing to the document:

Split the documents into spans using a span_getter function. These spans may overlap, and they probably should: that way each token will get an embedding using more context.
Build wordpieces by using a Hugging Face tokenizer.
Truncate wordpieces if need be.
Feed the wordpieces to a transformer model.
Get a single representation for each token by averaging the results (since a token might go through the transformer multiple time thanks to the overlap).

why is the huggingface warning emitted in the first place

The warning is an interesting indication that something unexpected is happening... We could filter it and issue a dedicated warning though.

But wouldn´t it be possible to use the huggingface tokenizer first and try to align these tokens with the spacy tokens afterwards?

We see the rule-based hackable tokenization as one of the strengths of spaCy. However, you can change it altogether (and even use a BERT wordpiece tokenizer)

explosion / spaCy

NER fails on warning "Token indices sequence length is longer than the specified maximum" #12622

How to reproduce the behaviour

Your Environment