Open MattGPT-ai opened 2 months ago
Hi @MattGPT-ai
can you elaborate what token limit you are refering to, or file a bug report?
The only tokenlimit that should exist is in the TransformerEmbeddings
, which you can workaround with the allow_long_sentences
option.
If that's not the case, I'd like to have a reproducible example of that limitation
I did just confirm that I could successfully use allow_long_sentences=True
to train a simple model and pick up on entity classes that were only contained beyond the transformers token limit of 512. However, our training still fails eventually in the training with a OutOfMemoryError: CUDA out of memory.
error, and unfortunately, I can't share the dataset that causes the error since it contains PII.
I will try some more things to see if I can reproduce it or narrow down if there is a particular issue, perhaps a memory leak or maybe there is just a particularly large batch that it's failing on
https://gist.github.com/MattGPT-ai/80327ab5854cb0d978d23f205eeae882
Linking to a gist with notebooks that demonstrate success using allow_long_sentences
, and an OOM failure that results by increasing the sentence size a bit. So I think that while this utility isn't always necessary, it can be helpful
Would it be possible to refactor the training script such that batching is based on the chunked inputs? It seems like now maybe it does not get a consistent batch size after chunking. Could you offer any insight here? I also see there is a mini_batch_chunk_size
, not sure if that could help here, but I didn't quite understand from the docs what that parameter does. Trying to dig more into the source code
The problem with longer sentences is that they will inevitably require more RAM, since the gradient information of all tokens is required. That said, having a sentence in your batch that contains 10k (sub-)tokens, means that you will have more than 20 transformer-passes in your TransformerEmbedding and are expected to fit the respective memory requirement.
With sentences that long, you will want to have only 1 sentence computed at a time. For this, you can use the mini_batch_chunk_size
, which allows you to compute fewer sentences in parallel while keeping the batch_size as high. (also known as gradient accumulation).
Notice that you can split the batch by sentences without loss of quality, but you cannot split sentences, as the token embeddings/gradients are dependent on each other.
In general I would recommend setting the mini_batch_chunk_size
to 1 when you are working with long sentences, if that doesn't work you have either a GPU with very little RAM and should consider upgrading or you have very long texts, where you can use a Sentence Splitter to split the long text into real sentences.
Notice with the later you need to adjust both the training and the inference code as the model will be only capable of predicting shorter sentences.
I am giving the mini_batch_chunk_size
a test, does look like it helps to resolve the memory issue for our case.
I think at the very least if the chunking function isn't useful, that this could be reduced to a function to create a labeled sentence from a text with character-indexed entities.
As far as sentence chunking, I'm still a little unclear if there is no use case, perhaps because I'm confused by the multiple uses of the word "sentence." Let's say in our case we have very long texts, such as a resume, that contain many actual sentences. If some of the full resumes do not fit into memory, when is it invalid to split one into multiple Sentence
objects so it essentially becomes multiple samples so that we don't lose any of the annotated data. Are you saying that the splitting would need to be done at actual sentence boundaries to be valid?
I agree that Sentence
is confusing. That naming is only like that due to initial naming.
So I hope I can clarify how I meant this, I will refer to literal sentences
as a linguistic Sentence and Sentence-object
for the Flair-Class "Sentence":
You could split your Resumes to literal sentences using the SentenceSplitter. That means you won't have 1 Sentence-object per resume but multiple smaller ones. For those, the SentenceSplitter adds the next & previous objects as context, so you can use a FLERT-Transformer TransformerWordEmbeddings(..., use_context=True)
.
That way, you will require less memory for training, but still have prediction on a literal-sentence level while taking the literal-sentences around as context.
And yes, this way the actual sentence boundaries would always be valid, assumably making it easier for the model to learn
About the labels from char-indices: I had to double-check, as I thought this already exists. Well, it kinda does in the JsonlDataset, but in not accessible as a general utils-function. So here I agree with you, that that would be a useful contribution.
Problem statement
Currently, we are not able to train
SequenceTagger
models with taggedSentence
objects exceeding the token limit (typically 512). It does seem there is some support for long sentences in embeddings via theallow_long_sentences
option, but it does not appear that this applies to sequence tagging where the labels still need to be applied at the token level.We have tried doing this, but if we don't limit the sentences to the token limit, we get an out of memory error. Not sure if this is a bug specifically, or just a lack of support for this feature.
Solution
Not sure if there is a more ideal way, but one solution for training is to split a sentence into "chunks" that are of length 512 tokens or less, and applying the labels to these chunks. It is important to avoid splitting chunks across a labeled entity boundary.
Additional Context
We have used this in training successfully, so I will be introducing our specific solution in a PR