A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.4k
stars
1.4k
forks
source link
adding --fp16 to run_language_modeling and increase batch size but cuda out of memory error #773
Open
mahdirezaey opened 4 years ago
Hi all
I am using colab , 1 GPU , Tesla P100-PCIE-16GB
code below from hugging face ran OK
!python /content/transformers/examples/run_language_modeling.py --output_dir=/content/outputs --model_type=bert --model_name_or_path=bert-base-cased --num_train_epochs 1 --do_train --do_eval --per_gpu_train_batch_size 152 --train_data_file=/content/input_data/trn.txt --eval_data_file=/content/input_data/val.txt --evaluate_during_training --learning_rate 1e-4 --overwrite_output_dir --tokenizer_name /content/token/ --block_size 64 --mlm
(and batch_size 152 was max num i was able to run without cuda out of memory ) then installing apex by
%%writefile setup.sh
export CUDA_HOME=/usr/local/cuda-10.1 git clone https://github.com/NVIDIA/apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
!sh setup.sh
then adding " --fp16" to code but i was not able to increase batch size , even abit
@julien-c , @ugent , @LysandreJik , @thomwolf do you know that ?