Naver-AI-Hackathon / cs492I

2 stars 0 forks source link

killed message #55

Closed DonghwanKIM0101 closed 3 years ago

DonghwanKIM0101 commented 3 years ago

When we run the model, we got some killed message. But, we don't know why :(

/bin/bash: line 1: 106 Killed python -u run_squad.py --model_type electra --model_name_or_path monologg/koelectra-base-v2-finetuned-korquad --do_train --do_eval --data_dir train --num_train_epochs 4 --per_gpu_train_batch_size 24 --per_gpu_eval_batch_size 24 --output_dir output --verbose_logging --overwrite_output_dir --version_2_with_negative /bin/bash: line 1: 113 Killed python -u run_squad.py --model_type electra --model_name_or_path monologg/koelectra-base-v2-finetuned-korquad --do_train --do_eval --data_dir train --num_train_epochs 1 --per_gpu_train_batch_size 24 --per_gpu_eval_batch_size 24 --output_dir output --verbose_logging --overwrite_output_dir --version_2_with_negative

Two differenet sessions got '106 killed' and '113 killed'. Can we know why the session is killed?

bluebrush commented 3 years ago

Could you let me know the session IDs?

DonghwanKIM0101 commented 3 years ago

113 Killed for kaist0015/korquad-open-ldbd3/168 session and 106 Killed for kaist0015/korquad-open-ldbd3/170 session

DonghwanKIM0101 commented 3 years ago

Also 107 Killed for KAIST/0015/korquad-open-ldbd3/171 session :(

bluebrush commented 3 years ago

The process was killed by OOM (out of memory). There are ways to reduce the batch size or increase memory. The default execution value of nsml is as follows.

Session[kaist0015/korquad-open-ldbd3/168] is killed by OOM killer
Session[kaist0015/korquad-open-ldbd3/170] is killed by OOM killer

NSML Default: (without options)

nsml run -d XXX -e entery.py  --memory 24G --shm-size 1G

The memory increase options are as follows.

nsml run -d XXX -e entery.py  --memory 28G --shm-size 2G
DonghwanKIM0101 commented 3 years ago

Thank you for your kindness