Open ShaoMinLiu-Holmusk opened 2 years ago
Maybe there is a long text sequence? Transformers have quadratic memory requirement in text length. Try to reduce the max_seq_length
Maybe there is a long text sequence? Transformers have quadratic memory requirement in text length. Try to reduce the max_seq_length
Thank you for you quick response, really appreciate it. I am mostly training with short sentences, but I will check that as well.
I did some quick analysis on the distribution of data ingested till the point the memory overflows.
It appears that the number of tokens appear in epoch 2, batch 94 does have one of the longest token count. And the visualisation of memory use does coincide with these inputs.
Just one question if you don't mind, I was under the impression that even for short sequence inputs, given the specified max_seq_length, the trailing empty tokens will be padded with [PAD] tokens or truncated the excess tokens. So I always imagine that all sentences will be padded to the same length, thus tensors will use the same size.
suppose max_seq_len == 10
inputA = 'hello, nice to meet you'
-> 'hello, nice to meet you [PAD] [PAD] [PAD] [PAD] [PAD]' (padded to 10 tokens)
-> tensor length == 10
inputB = 'hi, I have heard many things about you, its nice to finally meet you.'
-> 'hi, I have heard many things about you, its nice' (truncated to 10 tokens)
-> tensor length == 10
Can you briefly explain why does the input sentence matter here, are they not suppose to use the same amount of memory? since they have fixed tensor dimensions after tokens being converted to index representation in the corpus?
While I understand that memory use generally increases over one forward pass. But is it correct to say that the expected maximum memory use for each epoch(each batch) should be constant? It appears that the answer to my question is no, but I cannot figure out why.
Text is padded to the shortest amount possible, which gives you much faster training than padding to max length
Memory use will remain below 2Gb most of the time during training, using the following configuration. But will soon encounter OOM at epoch-2, iteration 94.
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 7.44 GiB total capacity; 6.42 GiB already allocated; 78.31 MiB free; 6.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I repeated the same configuration twice, and the error occurs at the exact same step.
If I reduce the batchSize of BinarySimilarityDataset, this will delay the error from appearing.
I have masked some of the information to avoid leaking sensitive information.
configuration file:
Training script: