Closed 1treu1 closed 2 years ago
Hey good afternoon this error usually indicates that the machine is running out of CPU or GPU memory. See also the warning of the unsupported GPU version you are using. I suggest to run the code on other machine, Google Colab can be an option. Closing the issue since it's not dependent on the code.
I did it in Google Colab and I don't save any new model. By any chance it overwrites over the one I gave it to do the pretraining?
I attach what appeared to me in the console after running the .sh :
2022-02-03 19:49:40.675 | WARNING | main:main:226 - Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s
2022-02-03 19:49:40.676 | INFO | main:main:228 - Training/evaluation parameters %s
2022-02-03 19:49:40.677 | INFO | main:main:240 - model_name_or_path provided as: %s
2022-02-03 19:49:40.702 | INFO | main:main:268 - Tokenizer dir was specified. The maximum length (in number of tokens) forthe inputs to the transformer model, model_max_length
is: %d
2022-02-03 19:49:45.101 | INFO | main:main:314 - Param check: block_size is: %d
2022-02-03 19:49:45.102 | INFO | main:main:318 - Preparing training dataset from: %s
tcmalloc: large alloc 3023241216 bytes == 0x55dccaa5e000 @ 0x7f6bd7b711e7 0x55dc97a38fa9 0x55dc97b1d308 0x55dc97a3b84b 0x55dc97a3bbef 0x55dc97b3b815 0x55dc97a3b7c7 0x55dc97a3bbef 0x55dc97a99e84 0x55dc97b48217 0x55dc97aac8e0 0x55dc97aacebd 0x55dc97ad310a 0x55dc97a18ccd 0x55dc97a1a480 0x55dc97aa0ea4 0x55dc97aabb8c 0x55dc97aacf79 0x55dc97ad3c9b 0x55dc97a18ccd 0x55dc97a4a570 0x55dc97aacde8 0x55dc97ad3c9b 0x55dc97a4a3d4 0x55dc97aacde8 0x55dc97ad3074 0x55dc97a18ccd 0x55dc97a19dd9 0x55dc97af652b 0x55dc97b595b3 0x55dc97b63617
tcmalloc: large alloc 3023241216 bytes == 0x55dd7ed8e000 @ 0x7f6bd7b711e7 0x55dc97a132f1 0x55dc97a5c27e 0x55dc97aab6a8 0x55dc97aacef8 0x55dc97ad726a 0x55dc97a18ccd 0x55dc97a1a13d 0x55dc97aa0a63 0x55dc97a1acf5 0x55dc97a9fa80 0x55dc97aa0109 0x55dc97b48245 0x55dc97aac8e0 0x55dc97aacebd 0x55dc97ad310a 0x55dc97a18ccd 0x55dc97a1a480 0x55dc97aa0ea4 0x55dc97aabb8c 0x55dc97aacf79 0x55dc97ad3c9b 0x55dc97a18ccd 0x55dc97a4a570 0x55dc97aacde8 0x55dc97ad3c9b 0x55dc97a4a3d4 0x55dc97aacde8 0x55dc97ad3074 0x55dc97a18ccd 0x55dc97a19dd9
run_language_modeling_script.sh: line 36: 1432 Killed python ../paccmann_proteomics/run_language_modeling.py --output_dir $OUTPUT_DIR --model_name_or_path $MODEL_NAME --model_type $MODEL_TYPE --tokenizer_name $TOKENIZER --train_data_file $TRAIN_FILE --eval_data_file $EVAL_FILE --logging_steps 400 --save_steps 400 --line_by_line --chunk_length 10000 --logging_dir $OUTPUT_DIR/logs --mlm --num_train_epochs $NUM_EPOCHS --learning_rate 1e-3 --per_device_train_batch_size $BATCH_SIZE --per_device_eval_batch_size $BATCH_SIZE --seed $SEED --block_size 512 --do_train --do_eval --overwrite_output_dir --chunk_length 1000000 --overwrite_cache
Unfortunately the stack trace you pasted show that the code is still going out of memory. I'm afraid that you will need larger computer resources to run the script.
Hello, good afternoon. Start the pretraining script: bash run_language_modeling_script.sh but it is not saving the new pretrained model in the out folder:
export OUTPUT_DIR=../trained_models/out/ export MODEL_NAME=../trained_models/exp4_longformer/ export TOKENIZER=../trained_models/exp4_longformer/ export MODEL_TYPE=roberta export TRAIN_FILE=../paccmann_proteomics/data/pretraining/train_1seq.txt export EVAL_FILE=../paccmann_proteomics/data/pretraining/dev_1seq.txt export BATCH_SIZE=4 export NUM_EPOCHS=10 export SAVE_STEPS=750 export SEED=1
python ../paccmann_proteomics/run_language_modeling.py \ --output_dir $OUTPUT_DIR \ --model_name_or_path $MODEL_NAME \ --model_type $MODEL_TYPE \ --tokenizer_name $TOKENIZER \ --train_data_file $TRAIN_FILE \ --eval_data_file $EVAL_FILE \ --logging_steps 400 \ --save_steps 400 \ --line_by_line \ --chunk_length 10000 \ --logging_dir $OUTPUT_DIR/logs \ --mlm \ --num_train_epochs $NUM_EPOCHS \ --learning_rate 1e-3 \ --per_device_train_batch_size $BATCH_SIZE \ --per_device_eval_batch_size $BATCH_SIZE \ --seed $SEED \ --block_size 512 \ --do_train \ --do_eval \ --overwrite_output_dir \ --chunk_length 1000000 \ --overwrite_cache \ I attach an image of the console: