allenai / vampire

Variational Methods for Pretraining in Resource-limited Environments
Apache License 2.0
174 stars 33 forks source link

Unexpected behavior changing the vocabulary size #32

Closed AbeHandler closed 5 years ago

AbeHandler commented 5 years ago

I am excited to try out this method! I found two small issues trying to change the vocab size with the current code.

Minor issue 1:

$git reset --hard origin/master && python -m scripts.make_reference_corpus examples/ag/dev.jsonl examples/ag/reference --vocab-size 1000

I get an error "TypeError: '>=' not supported between instances of 'str' and 'int'"

I think the issue is just that you need to specify the type of the vocab_size argument in scripts/make_reference_corpus.py.

Changing line 49 of scripts/make_reference_corpus.py toparser.add_option('--vocab-size', dest='vocab_size', default=None,**type=int** fixes the error for me.

Minor issue 2:

I seems like the vocab size is hardcoded into the VAMPIRE environment. So if you run python -m scripts.train after preprocessing with a vocabulary size that is not 30K you will get tensor mismatch errors from torch.

https://github.com/allenai/vampire/blob/d3662bdf8971e961076d799536a05a0a9c397536/environments/environments.py#L66

Happy to put in a PR if that is helpful. But that is maybe overkill.

dangitstam commented 5 years ago

Thanks for brining this to our attention! Minor issue 1 is now taken care of.

As for the second issue, this is a great point. I believe it's intended that you coordinate between preprocessed data and different environments explicitly; perhaps we can add a way to override that from the command line.

kernelmachine commented 5 years ago

The second issue is addressed by #34 , so closing this!