Closed cdhx closed 2 years ago
Did you modify any hyper-parameters or model part? According to my impression, this command can run in single 12 GB memory GPU. You can try to reduce the batch_size, and see how it works in your machine.
Thanks for your replay! I did not change any parameter, and i will try it again. Another question is that why it can continue running after got the error in title, does it not matter?
It matters, you should have checkpoint/CWQ_teacher/../pretrain/CWQ_nsm-final.ckpt (place this checkpoint in checkpoint/pretrain folder) downloaded from google drive. Then run the first command will generate a teacher ckpt.
Here is the model in google drive (CWQ_report) ,which one should i chose?
teacher:
CWQ_parallel_teacher_gnn_js-f1.ckpt CWQ_teacher_fb_gnn_js_80epoch-final.ckpt
student:
CWQ_fb_student_0.01-final.ckpt CWQ_t_gnn-parallel_s_gnn_js_100epoch-final.ckpt
CWQ_fb_student_0.01-h1.ckpt CWQ_t_gnn-parallel_s_gnn_js_100epoch-h1.ckpt
CWQ_fb_student_0.01.log CWQ_t_gnn-parallel_s_gnn_js_100epoch.log
I checked again, you should first run the commented line: CUDA_VISIBLE_DEVICES=0 python main_nsm.py --name CWQ --model_name gnn --data_folder /home/hegaole/data/KBQA/Freebase/CWQ/ --checkpoint_dir checkpoint/pretrain/ --batch_size 20 --test_batch_size 40 --num_step 4 --entity_dim 50 --word_dim 300 --kg_dim 100 --kge_dim 100 --eval_every 2 --experiment_name CWQ_nsm --eps 0.95 --num_epoch 100 --use_self_loop --lr 5e-4 --q_type seq --word_emb_file word_emb_300d.npy --reason_kb --encode_type --loss_type kl
, then the checkpoint/CWQ_teacher/../pretrain/CWQ_nsm-final.ckpt will be generated
I run run_CWQ.sh. Here is all log, i have download all data file.
Why is can still run after get error.
How many memory does it need, my GPU is 24GB but out of memory.
Thx