roberta with long text instances

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.9k stars 2.1k forks source link

roberta with long text instances #2847

Closed miwieg closed 1 year ago

miwieg commented 2 years ago

I am using FLAIR for text classification ("TextClassifier") with RoBERTa. My dataset just contains about 2000 instances but the text instances themselves are fairly long, i.e. longer than 512 tokens. I understand that, in principle, those transformers are not capable of processing such long text instances, so I am wondering how FLAIR solves this issue. (My code using FLAIR is running and producing reasonable results.) Does FLAIR cut off the text after 512 tokens or does pursue a striding window approach?

Thank you very much.

helpmefindaname commented 2 years ago

Hi @miwieg, TransformerEmbeddings provide a parameter allow_long_sentences if that parameter is set to True, the embeddings will take some overlap to compute the token embeddings. (E.g. "This is a very very very long sentence" with a token length of 6 would be split into: "This is a very very very" and "very very long sentence" and both gets embedded.

For TextClassification, you can use this by setting the cls_pooling parameter either to max or mean. To gather the context of all tokens. Consider that using the default cls won't be sufficient, as there only the first sub-sentence will be used.

miwieg commented 2 years ago

Thank you very much for your reply.

What is the default setting of TextClassifier? Does it simply strip off the any tokens following the 512th token? My instances actually comprise more than one sentence. So, I guess cls is not a good choice?

miwieg commented 2 years ago

Just to clarify whether I understood you correctly:

If I follow the typical text classification example, i.e. "Training a Text Classification Model" in: https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md

Do I just a have to add the following lines after the initialization of the TransformerDocumentEmbeddings-object

 document_embeddings.allow_long_sentences = True
 document_embeddings.cls_pooling = "mean"

It would also be good to know how the large text is processed if everything is left at default.

Thank you.

helpmefindaname commented 2 years ago

I don't know if that works, I would rather add it to the constructor: document_embeddings = TransformerDocumentEmbeddings(..., allow_long_sentences=True, cls_pooling="mean")

miwieg commented 2 years ago

This is what I originally tried but the constructor does not account for these parameters. I ran the code as I suggested above. The code could be run on my data -- I did not receive any error messages. Can I conclude from that that the pooling was implemented as requested?

helpmefindaname commented 2 years ago

You'll never receive error messages by setting variables - no matter if they existed before or not - that doesn't say anything.

Are you sure that you are on the latest version (flair==0.11.4)? If not, you need to update.

miwieg commented 2 years ago

Thank you for the hint the version number. I've updated it. However, the newest version that seems to be available is 0.11.3. Would that version be already outdated or is 0.11.3 fine?

helpmefindaname commented 2 years ago

yes, sorry I meant 0.11.3

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.