Closed miwieg closed 1 year ago
Hi @miwieg,
TransformerEmbeddings provide a parameter allow_long_sentences
if that parameter is set to True, the embeddings will take some overlap to compute the token embeddings. (E.g. "This is a very very very long sentence" with a token length of 6 would be split into:
"This is a very very very" and "very very long sentence" and both gets embedded.
For TextClassification, you can use this by setting the cls_pooling
parameter either to max
or mean
. To gather the context of all tokens. Consider that using the default cls
won't be sufficient, as there only the first sub-sentence will be used.
Thank you very much for your reply.
What is the default setting of TextClassifier
? Does it simply strip off the any tokens following the 512th token?
My instances actually comprise more than one sentence. So, I guess cls
is not a good choice?
Just to clarify whether I understood you correctly:
If I follow the typical text classification example, i.e. "Training a Text Classification Model" in: https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md
Do I just a have to add the following lines after the initialization of the TransformerDocumentEmbeddings
-object
document_embeddings.allow_long_sentences = True
document_embeddings.cls_pooling = "mean"
It would also be good to know how the large text is processed if everything is left at default.
Thank you.
I don't know if that works, I would rather add it to the constructor:
document_embeddings = TransformerDocumentEmbeddings(..., allow_long_sentences=True, cls_pooling="mean")
This is what I originally tried but the constructor does not account for these parameters. I ran the code as I suggested above. The code could be run on my data -- I did not receive any error messages. Can I conclude from that that the pooling was implemented as requested?
You'll never receive error messages by setting variables - no matter if they existed before or not - that doesn't say anything.
Are you sure that you are on the latest version (flair==0.11.4)? If not, you need to update.
Thank you for the hint the version number. I've updated it. However, the newest version that seems to be available is 0.11.3. Would that version be already outdated or is 0.11.3 fine?
yes, sorry I meant 0.11.3
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am using FLAIR for text classification ("TextClassifier") with RoBERTa. My dataset just contains about 2000 instances but the text instances themselves are fairly long, i.e. longer than 512 tokens. I understand that, in principle, those transformers are not capable of processing such long text instances, so I am wondering how FLAIR solves this issue. (My code using FLAIR is running and producing reasonable results.) Does FLAIR cut off the text after 512 tokens or does pursue a striding window approach?
Thank you very much.