Question about argument: --serialization-dir

juexZZ commented 6 years ago

According to the tutorial, this command : The serialization-dir argument specifies the directory where the model's vocabulary and checkpointed weights will be saved So when I was trying to run a BiDAF model with default config, I ran into an config ERROR:

Serialization directory (/mydir) already exists

My command is just like that in the tutorial, only changed the json file into bidaf.json, and changed the serialization dir into /mydir Is this suppose to be a existing directory so the vocabulary and weights can be saved? If so, why dose this error occur? Or am I missing something out due to lack of experience? Then how to use this argument currectly? Thanks!

matt-gardner commented 6 years ago

The train command will create the directory that you specify, and put all of its output in that directory. We don't want to clobber anything that already exists in the directory, so we refuse to continue if the directory is already there.

We've talked about allowing for the case where the directory already exists, but is empty, and that seems like a reasonable thing to support, but it's not done yet. You just need to be sure you give a path that doesn't already exist as the --serialization-dir.

juexZZ commented 6 years ago

@matt-gardner Thank you! Actually I've run into another problem... When I'm running BiDAF, an Assertion ERROR occured:

File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 165, in get_parameters assert filter_dim_a.prod() == filter_dim_a[0] AssertionError

The bidaf.json config file is not changed except the data paths, which I think should not be responsible for the problem. It seems that it has something to do with cudnn, but none of my coworkers says there is any problem of our GPUs. My cuda version is 8.0.44 and cudnn 6021. As for pytorch, I installed the 0.3.1 (post3) from the official website following the instructions in your tutorials. Is this reported before? Do you know what is the cause of this error? Is there any possibility that this is a problem of version mismatch? Because some people talked about a similar problem in https://github.com/pytorch/pytorch/issues/5667 And most importantly, how can I fix it? Thanks for your help!

trackback details if needed:

Traceback (most recent call last): File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data/disk1/private/mydir/allennlp/allennlp/run.py", line 18, in main(prog="python -m allennlp.run") File "/data/disk1/private/mydir/allennlp/allennlp/commands/init.py", line 65, in main args.func(args) File "/data/disk1/private/mydir/allennlp/allennlp/commands/train.py", line 99, in train_model_from_args args.recover) File "/data/disk1/private/mydir/allennlp/allennlp/commands/train.py", line 129, in train_model_from_file return train_model(params, serialization_dir, file_friendly_logging, recover) File "/data/disk1/private/mydir/allennlp/allennlp/commands/train.py", line 285, in train_model trainer_params) File "/data/disk1/private/mydir/allennlp/allennlp/training/trainer.py", line 892, in from_params model = model.cuda(cuda_device) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda return self._apply(lambda t: t.cuda(device)) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply module._apply(fn) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply module._apply(fn) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply self.flatten_parameters() File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 111, in flatten_parameters params = rnn.get_parameters(fn, handle, fn.weight_buf) File "/data/disk1/private/mydir/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 165, in get_parameters assert filter_dim_a.prod() == filter_dim_a[0] AssertionError

matt-gardner commented 6 years ago

I'm afraid that's not an allennlp issue, it's a pytorch / cuda issue, and I don't know how to help you resolve it. You could try just using our docker image, as that environment is known to have working package versions installed. You could also grab the docker image just to compare versions of the various libraries, so you can be sure to install ones that work.

allenai / allennlp

Question about argument: --serialization-dir #1093