facebookresearch / CodeGen

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.
MIT License
710 stars 144 forks source link

Training script stuck (using eval only) #26

Open akshitdewan opened 3 years ago

akshitdewan commented 3 years ago

I'm trying to run

python codegen_sources/model/train.py --eval_only True --reload_model 'TransCoder_model_2.pth,TransCoder_model_2.pth' --data_path "test_dataset" --exp_name transcoder --dump_path 'dump' --lgs 'java_sa-python_sa'  --bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa'  --ae_steps 'python_sa,java_sa'  --mt_steps 'java_sa-python_sa,python_sa-java_sa' --encoder_only False --emb_dim 1024 --n_heads 8 --n_layers 0 --n_layers_encoder 6  --n_layers_decoder 6 --eval_bleu true --eval_computation true --has_sentences_ids true

but it is unable to find the following files in the test_dataset (downloaded from the transcoder doc)

test_dataset/train.java_sa.pth not found
test_dataset/valid.java_sa.pth not found
test_dataset/test.java_sa.pth not found
test_dataset/train.python_sa.pth not found
test_dataset/valid.python_sa.pth not found
test_dataset/test.python_sa.pth not found
test_dataset/train.java_sa-python_sa.java_sa.pth not found
test_dataset/train.java_sa-python_sa.python_sa.pth not found

and gets stuck after this log message:

SLURM job: False
0 - Number of nodes: 1
0 - Node ID        : 0
0 - Local rank     : 0
0 - Global rank    : 0
0 - World size     : 1
0 - GPUs per node  : 1
0 - Master         : True
0 - Multi-node     : False
0 - Multi-GPU      : False
0 - Hostname       : <host>

Any ideas why this might be happening?

I also tried running this translation script, which similarly seems to get stuck:

python -m codegen_sources.model.translate --src_lang python --tgt_lang java --model_path TransCoder_model_2.pth.1 --beam_size 1 < hello.py
adding to path /srv/home/akshit/CodeGen
INFO - 09/27/21 05:56:56 - 0:00:06 - ============ Model Reloading
INFO - 09/27/21 05:56:56 - 0:00:06 - Reloading encoder from TransCoder_model_2.pth.1 ...
baptisteroziere commented 3 years ago

It shouldn't cause any issues if it doesn't find these files if it's for evaluation only. I don't know why the script is getting stuck on your machine. Are you maybe using all your RAM and starting to write on disk (that would slow down everything)? Can you do simple operations with pytorch in your environment without getting stuck? Can you load the models/ datasets in your environment (torch.load("PATH/TransCoder_model_2.pth"))?