Open fierceX opened 5 years ago
The purpose of pad is to form a rectangular input so that one can achieve mini-batch acceleration, and the number of paddings is determined by the maximum length in the batch. The memory cost can be reduced by using bucketing. You can also consider clipping the training sequence in the preprocessing pipeline.
Hi, @fierceX , I simply looked at the code in 7219400, where the dataloader uses batchify directly, which may result in too much length difference for each sample in each batch of data when the batch is randomly generated. This can lead to too much unnecessary padding, which may be the reason for memory out of bounds. You can use nlp.data.sampler.FixedBucketSampler
to improve performance, or use gluonnlp.data.ClipSequence
to clip the maximum length of the sentence in data preprocessing.
@kenjewu In the submission of https://github.com/dmlc/gluon-nlp/pull/493/commits/b6be61cd6d88d1e559c0360b5be35256bba7dc93, I used FixedBucketSampler
.
But GPU memory will also increase, and will gradually increase to the same size as before.
Can you show sampler.stats()
here? This may help to troubleshoot the problem.
batch_size=6
, num_buckets=10
, ratio=0
, shuffle=True
FixedBucketSampler:
sample_num=88641, batch_num=14777
key=[78, 112, 146, 180, 214, 248, 282, 316, 350, 384]
cnt=[3698, 7202, 24221, 21614, 13458, 7671, 4806, 2772, 1499, 1700]
batch_size=[6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
From the key
you can see the length of the sequence in each bucket. Is this a big difference from the maximum length you set before? If the difference is too large, you may need to clip the long sequence.
gluonnlp.data.batchify.Pad
should be added with a parameter of maximum length. Because variable-length input increases memory usage, it is more obvious to pre-train the bert model on the squad dataset.In the https://github.com/dmlc/gluon-nlp/pull/493/commits/7219400545f15a7597d1bde98efd608ef96bcbd0 submission, I used
gluonnlp.data.batchify.Pad
to dynamically fill the length of the batch. However, it will cause a significant increase in the amount of memory used.The specific performance is to use
python finetune_squad.py --optimizer adam --batch_size 12 --lr 3e-5 --epochs 2 --gpu
when running the script.Previously occupied no more than 13G of GPU memory. However, after use, the 16G V100 will have insufficient memory.