Closed vatsalaggarwal closed 6 years ago
I keep getting OOM issues on the GPU while trying to train the teacher. I've tried batch-size 28, 24, 20, and 16 on 4 GPUs. 28 crashes immediately. 24 takes a while. 20 even longer. Then, 16 even longer but also crashes at around 62000 steps.
Any ideas?
Depending on the number of GPUs you are using for training. I normally set the batch_size = Number_of_GPU * 3 (Each GPU memory is 11178MiB).
I keep getting OOM issues on the GPU while trying to train the teacher. I've tried batch-size 28, 24, 20, and 16 on 4 GPUs. 28 crashes immediately. 24 takes a while. 20 even longer. Then, 16 even longer but also crashes at around 62000 steps.
Any ideas?