flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

Text instance length limit for transformer models in Flair #1906

Closed krzysztoffiok closed 3 years ago

krzysztoffiok commented 4 years ago

Hi,

I understand that most transformer models follow BERT regarding the limit of maximum length of the analyzed text instance and that this values is 512 tokens.

So I deliberately started fine tuning classification model Albert base v2 on text instances over 1000 tokens long and... nothing crashed.

How is it so? How does Flair handle this limit? Or maybe those models are already prepared to handle longer text entities and it is only my ignorance that I don't know how they do it?

Any explanations will be very appreciated!

Best,

schelv commented 4 years ago

This is what happens: Flair processes the text in 512 token blocks (strided). Each block gets its own transformer prediction. The transformer embedding output is put back together.

The model stays exactly the same, as does it limitations. The context for the embedding is still 512 tokens wide. So using this for (longer) text level classification has no added benefit (besides not crashing).

You are completely right on everything you said! I think the flair documentation on this subject is lacking 😅.

Maybe the documentation should be updated, or a warning should be given when a model uses this feature.

krzysztoffiok commented 4 years ago

@schelv Thank you very much for very quick response.

Please let me rephrase and further detail your answer so I'm sure I understand properly. For example, given:

I get 3 blocks of length 512, 512 and 176 tokens? Next for each block I get a 3072 long embedding outputted from the model, so I end up with 3 embeddings of this length. So next, final entity_level_embedding for the whole text entity is what - averaged from those 3 block embeddings? And of length 3072 of course?

Thank you again.

djstrong commented 4 years ago

Striding works for TokenClassification, for TextClassification a text is stripped to 512.

krzysztoffiok commented 4 years ago

@djstrong thank you, so I end up with first 512 tokens being analyzed by the model correct?

krzysztoffiok commented 4 years ago

@djstrong could you point me to a place in code where I could modify this behavior?

djstrong commented 4 years ago

https://github.com/flairNLP/flair/blob/d75b82bc6a33f4655c46f0e19d09ee5f2c24c93d/flair/embeddings/document.py#L115-L119

krzysztoffiok commented 4 years ago

@djstrong thank you again.

krzysztoffiok commented 4 years ago

@djstrong I understand that if I decide for a transformer model like longformer Flair will adopt this model_max_length=whatever the longformer (4096?) tokens max is?

OK, I see it does. Thanks again.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.