Closed arueckle closed 5 years ago
For the BPE, we made a few minor modifications to the dataset scripts for release, which could modify the dictionary slightly. I will modify the README so the dictionary is available for download.
Added the dictionary, see 64adbbdda26cdc2c9dc30d1e3d7f212dc9d7d901
Thanks! I can now run the generate.py with your trained model and the updated dictionary. However, running it will skip most of the examples in the dev split:
| WARNING: 9893 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[9052, 6593, 4710, 9081, 8042, 5242, 890, 7521, 7079, 3455]
When training a model myself I can fix this by setting
--max-source-positions 4096 --max-target-positions 4096
(this still skips a very small number of examples).
Adding these arguments during generation seems to have no effect (with the mtl model). Is there any argument/option missing in the example call that is listed after To generate from the model:
in the readme? (I checked the fairseq documentation for generate.py, which does not provide much information on that.)
Hi, thanks for your comment. I forgot the default fairseq setting was to skip examples that are too long. I usually do not use this and did not remember it wasn't default, sorry about that. Yes, please raise --max-source-positions and --max-target-positions. I believe for 4096, the number of examples skipped is quite small.
Ok, the results seem reasonable now. Running the MTL model on the generated dataset (test split) gives the following results:
{'rouge-1': {'f': 0.2847671357774766, 'p': 0.3017670193298072, 'r': 0.33106755140568883}, 'rouge-2': {'f': 0.05104244130744159, 'p': 0.04916310937403208, 'r': 0.0758954548011907}, 'rouge-l': {'f': 0.2302253705644382, 'p': 0.2739778629705043, 'r': 0.3009683755918875}}
which is only slightly below the scores from the paper. It was a bit more difficult to figure out the right call though (needed to read through some parts of the fairseq code). Maybe it is interesting for others:
python generate.py <path/to/binarized/data> --path multitask_checkpoint.pt.1 --gen-subset test --nbest 1 --source-lang multitask_source_bpe --target-lang multitask_target_bpe --beam 5 --batch-size 2 --remove-bpe --no-repeat-ngram-size 3 --max-len-b 500 --min-len 200 --max-source-positions 4096 --max-target-positions 4096 --skip-invalid-size-inputs-valid-test --model-overrides "{'max_source_positions':4096, 'max_target_positions':4096}" > output-test-mtl.txt grep -P '^T' ../output-test-mtl.txt | cut -f2- > reference-test-mtl.txt grep -P '^H' ../output-test-mtl.txt | cut -f3- > generated-test-mtl.txt python compute_rouge.py --hypotheses=generated-test-mtl.txt --references=reference-test-mtl.txt > rouge-test-mtl.txt
Computing PPL though still gives weird results (like 1500 ppl), and it is unclear how to run the eval for fill-1 acc and rouge-20%.
python eval_lm.py <path/to/binarized/data>--task translation --path multitask_checkpoint.pt.1 --gen-subset test --source-lang multitask_source_bpe --target-lang multitask_target_bpe --batch-size 2 --remove-bpe --max-source-positions 4096 --max-target-positions 4096 --skip-invalid-size-inputs-valid-test --model-overrides "{'max_source_positions':4096, 'max_target_positions':4096, 'tokens_per_sample': 10000, 'add_bos_token': False}"
Thanks for creating this exciting new QA corpus!
I have downloaded and processed the data as per the description in the "Data creation" section of the readme (with some obstacles, which I will report separately). Is there any way to validate that the generated dataset and the intermediary files are correct (e.g., through checksums)?
Background I wanted to generate some outputs on the validation split using the MTL model you provided. Seemingly, the bpe dict of my binarized dataset contains a different number of entries than the pre-trained model you provided:
Running fairseq's interactive.py or generate.py we get this output after the script loaded the dataset:
Which is then followed by an error: