Open schudoku opened 1 year ago
Hello @schudoku, the issue comes from the fact that spaCy tokens and the output of HuggingFace's tokenizer do not align exactly. In particular, \r\n...\r\n
and contactcontact...contact
are treated as single tokens in spaCy, but will be split by HuggingFace (although the exact split will depend on the tokenizer used). In your extreme case, the document will span 9 tokens in spaCy, while the HuggingFace tokenizer will output 4011 tokens (!).
To cope with this, spacy-transformers uses a truncation mechanism defined here. It takes a radical approach: drop wordpieces from the end. In your case, the last tokens of the document won't be fed to the model, hence the inconsistent result.
We mitigate this risk by using a low window size in the span getter. To wit, the en_core_web_trf
configuration is:
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
In most cases, the 128-token window is sufficient to avoid the kind of issues you ran into.
You can see that we did think of other truncation strategies, but haven't had time to implement them.
Hi @bdura! Thank you for the insights. I guess the mismatch between the tokenizers is no easy thing to solve. But wouldn´t it be possible to use the huggingface tokenizer first and try to align these tokens with the spacy tokens afterwards? Then we could chunk (window) the document for the huggingface model correctly and avoid the warning.
Regarding the current truncation mechanism I am not sure if I understand it correctly. When the chunk is too long for the model, is only the rest of the chunk dropped or the rest of the complete document? As I understand you it is the latter. But I have observed that the NER performance normalizes some sentences after the big word occurs. And if there is a mitigation mechanism in place, why is the huggingface warning emitted in the first place?
I think the library would benefit, when this could be handled correctly. Suppose we extract text from a PDF or Email. Suppose the first page contains only two words, the title. Wouldn´t the extracted text contain many repeating whitespace equivalent characters? I have observerd this with emails where the reported issue came to my attention first. Applications with real data would benefit greatly from an improvement here.
Regards, Daniel
When the chunk is too long for the model, is only the rest of the chunk dropped or the rest of the complete document?
Not exactly: this is done on a span-by-span basis.
In broad terms, the transformer component within spaCy will apply the following processing to the document:
span_getter
function. These spans may overlap, and they probably should: that way each token will get an embedding using more context.why is the huggingface warning emitted in the first place
The warning is an interesting indication that something unexpected is happening... We could filter it and issue a dedicated warning though.
But wouldn´t it be possible to use the huggingface tokenizer first and try to align these tokens with the spacy tokens afterwards?
We see the rule-based hackable tokenization as one of the strengths of spaCy. However, you can change it altogether (and even use a BERT wordpiece tokenizer)
How to reproduce the behaviour
code:
console output:
NER performance: The NER performance of the last sentence degenerates:
analysis: When there are words or tokens that repeat massively, the warning is emitted. Unlike mentioned in other sources (https://github.com/explosion/spaCy/issues/6939) I think the impact is big. Before the occurence of the "big word" NER works normal but it fails on sentences directly following it. After some sentences the NER performance usually normalizes and works again. I am aware that spacy-transformers chunks big texts and thanks to that we can process long texts. But something is buggy and I think it has something to do with the chunking in the spacy-transformers library combined with the byte pair encoding used by the tokenizer of the trf model.
As long as the bug exists, the warning should be emitted every time and not just once. Unfortunately I could not change that behavior.
Your Environment