CUMLSec / XDA

78 stars 13 forks source link

UnboundLocalError when pretraining with provided data. #2

Open artificial-insanity opened 3 years ago

artificial-insanity commented 3 years ago

Hi, I am having a problem while trying to replicate the pretraining process of the model. I am running on a Ubuntu 18.04.5 LTS (GNU/Linux 5.9.11-3-MANJARO x86_64) machine with one GeForce RTX3090 GPU (NVIDIA-SMI 455.45.01, Driver Version: 455.45.01, CUDA Version: 11.1).

After running ./scripts/pretrain/preprocess-pretrain-all.sh to process the provided data in your repo under the data-src/pretrain_all, and running ./scripts/pretrain/pretrain-all.sh, I got an error UnboundLocalError: local variable 'num_updates' referenced before assignment. This happened after multiple overflow detected, setting loss scale to: XX messages. The full log is given in the txt file below.

pretrain_log.txt

Does this mean I need to tweak the parameters in the scripts/pretrain/pretrain-all.sh to get it running? Or do I need to use some data other than those provided in the data-src/pretrain_all to run the model?

I am a novice to this whole thing, so please allow me to apologize in advance if this was not a good question. Thank you!

peikexin9 commented 3 years ago

Hi @artificial-insanity, no problem! The "overflow detected" warning is normal, but I have not encountered this UnboundLocalError. Some searching seems suggesting a memory issue. Do you mind sending more logs? Also, can you also monitor your GPU: watch -n 1 nvidia-smi and observe its status while executing the command?