facebookresearch / SymbolicMathematics

Deep Learning for Symbolic Mathematics
Other
523 stars 114 forks source link

multi-gpu #17

Closed Peng-weil closed 3 years ago

Peng-weil commented 3 years ago

hello, I want to run your code on a dual graphics machine, But I encountered the following problems:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last): File "main.py", line 256, in check_model_params(params) File "/public/home/pw/workspace/symbolicmathematics/SymbolicMathematics-master/src/model/init.py", line 27, in check_model_params assert os.path.isfile(params.reload_model) AssertionError Traceback (most recent call last): File "main.py", line 256, in check_model_params(params) File "/public/home/pw/workspace/symbolicmathematics/SymbolicMathematics-master/src/model/init.py", line 27, in check_model_params assert os.path.isfile(params.reload_model) AssertionError Traceback (most recent call last): File "/public/home/pw/anaconda3/envs/mathematics/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/public/home/pw/anaconda3/envs/mathematics/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/public/home/pw/anaconda3/envs/mathematics/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/public/home/pw/anaconda3/envs/mathematics/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/public/home/pw/anaconda3/envs/mathematics/bin/python', '-u', 'main.py', '--local_rank=1', '--exp_name', 'first_eval', '--eval_only', 'true', '--reload_model', 'fwd.pth', '--tasks', 'prim_fwd', '--reload_data', 'prim_fwd,prim_fwd.train,prim_fwd.valid,prim_fwd.test', '--emb_dim', '1024', '--n_enc_layers', '6', '--n_dec_layers', '6', '--n_heads', '8', '--beam_eval', 'true', '--beam_size', '10', '--beam_length_penalty', '1.0', '--beam_early_stopping', '1', '--eval_verbose', '1', '--eval_verbose_print', 'false']' returned non-zero exit status 1.

I run it completely on a single graphics card,My complete command is as follows

$NGPU = 2; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py --exp_name first_eval --eval_only true --reload_model "fwd.pth" --tasks "prim_fwd" --reload_data "prim_fwd,prim_fwd.train,prim_fwd.valid,prim_fwd.test" --emb_dim 1024 --n_enc_layers 6 --n_dec_layers 6 --n_heads 8 --beam_eval true --beam_size 10 --beam_length_penalty 1.0 --beam_early_stopping 1 --eval_verbose 1 --eval_verbose_print false

my Python version is 3.7.10, PyTorch version is 1.3.0 and torchversion is 0.4.1, Maybe my version of PyTorch is wrong or I need to modify the default parameter of local_rank?

Thank You.

f-charton commented 3 years ago

Hello, the message is about the path you provide to --reload_model, assert os.path.isfile(params.reload_model) is failing, which means Python cannot open the model 'fwd.pth' (in the directory you run the program from, ie the one where main.py is...). Check the path, and try providing the absolute name (e.g. 'c:/User/me/SymbolicMaths/models/fwd.pth'). Note that the same problem might happen with --reload_data, and --dump_path.

Peng-weil commented 3 years ago

Thanks for your reply, I made some low-level mistakes! @f-charton