Closed DonghwanKIM0101 closed 3 years ago
Could you let me know the session IDs?
113 Killed for kaist0015/korquad-open-ldbd3/168 session and 106 Killed for kaist0015/korquad-open-ldbd3/170 session
Also 107 Killed for KAIST/0015/korquad-open-ldbd3/171 session :(
The process was killed by OOM (out of memory). There are ways to reduce the batch size or increase memory. The default execution value of nsml is as follows.
Session[kaist0015/korquad-open-ldbd3/168] is killed by OOM killer
Session[kaist0015/korquad-open-ldbd3/170] is killed by OOM killer
NSML Default: (without options)
nsml run -d XXX -e entery.py --memory 24G --shm-size 1G
The memory increase options are as follows.
nsml run -d XXX -e entery.py --memory 28G --shm-size 2G
Thank you for your kindness
When we run the model, we got some killed message. But, we don't know why :(
/bin/bash: line 1: 106 Killed python -u run_squad.py --model_type electra --model_name_or_path monologg/koelectra-base-v2-finetuned-korquad --do_train --do_eval --data_dir train --num_train_epochs 4 --per_gpu_train_batch_size 24 --per_gpu_eval_batch_size 24 --output_dir output --verbose_logging --overwrite_output_dir --version_2_with_negative /bin/bash: line 1: 113 Killed python -u run_squad.py --model_type electra --model_name_or_path monologg/koelectra-base-v2-finetuned-korquad --do_train --do_eval --data_dir train --num_train_epochs 1 --per_gpu_train_batch_size 24 --per_gpu_eval_batch_size 24 --output_dir output --verbose_logging --overwrite_output_dir --version_2_with_negative
Two differenet sessions got '106 killed' and '113 killed'. Can we know why the session is killed?