facebookresearch / ELI5

Scripts and links to recreate the ELI5 dataset.
Other
318 stars 40 forks source link

Validating the generated dataset #10

Closed arueckle closed 5 years ago

arueckle commented 5 years ago

Thanks for creating this exciting new QA corpus!

I have downloaded and processed the data as per the description in the "Data creation" section of the readme (with some obstacles, which I will report separately). Is there any way to validate that the generated dataset and the intermediary files are correct (e.g., through checksums)?

Background I wanted to generate some outputs on the validation split using the MTL model you provided. Seemingly, the bpe dict of my binarized dataset contains a different number of entries than the pre-trained model you provided:

Running fairseq's interactive.py or generate.py we get this output after the script loaded the dataset:

[multitask_source_bpe] dictionary: 53064 types
[multitask_target_bpe] dictionary: 53120 types

Which is then followed by an error:

RuntimeError: Error(s) in loading state_dict for TransformerModel:
    size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([52712, 1024]) from checkpoint, the shape in current model is torch.Size([53064, 1024]).
    size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([52864, 1024]) from checkpoint, the shape in current model is torch.Size([53120, 1024]).
huihuifan commented 5 years ago

For the BPE, we made a few minor modifications to the dataset scripts for release, which could modify the dictionary slightly. I will modify the README so the dictionary is available for download.

huihuifan commented 5 years ago

Added the dictionary, see 64adbbdda26cdc2c9dc30d1e3d7f212dc9d7d901

arueckle commented 5 years ago

Thanks! I can now run the generate.py with your trained model and the updated dictionary. However, running it will skip most of the examples in the dev split:

| WARNING: 9893 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[9052, 6593, 4710, 9081, 8042, 5242, 890, 7521, 7079, 3455]

When training a model myself I can fix this by setting

--max-source-positions 4096 --max-target-positions 4096

(this still skips a very small number of examples).

Adding these arguments during generation seems to have no effect (with the mtl model). Is there any argument/option missing in the example call that is listed after To generate from the model: in the readme? (I checked the fairseq documentation for generate.py, which does not provide much information on that.)

huihuifan commented 5 years ago

Hi, thanks for your comment. I forgot the default fairseq setting was to skip examples that are too long. I usually do not use this and did not remember it wasn't default, sorry about that. Yes, please raise --max-source-positions and --max-target-positions. I believe for 4096, the number of examples skipped is quite small.

arueckle commented 5 years ago

Ok, the results seem reasonable now. Running the MTL model on the generated dataset (test split) gives the following results:

{'rouge-1': {'f': 0.2847671357774766, 'p': 0.3017670193298072, 'r': 0.33106755140568883}, 'rouge-2': {'f': 0.05104244130744159, 'p': 0.04916310937403208, 'r': 0.0758954548011907}, 'rouge-l': {'f': 0.2302253705644382, 'p': 0.2739778629705043, 'r': 0.3009683755918875}}

which is only slightly below the scores from the paper. It was a bit more difficult to figure out the right call though (needed to read through some parts of the fairseq code). Maybe it is interesting for others:

python generate.py <path/to/binarized/data> --path multitask_checkpoint.pt.1 --gen-subset test --nbest 1 --source-lang multitask_source_bpe --target-lang multitask_target_bpe --beam 5 --batch-size 2 --remove-bpe --no-repeat-ngram-size 3 --max-len-b 500 --min-len 200 --max-source-positions 4096 --max-target-positions 4096 --skip-invalid-size-inputs-valid-test --model-overrides "{'max_source_positions':4096, 'max_target_positions':4096}" > output-test-mtl.txt grep -P '^T' ../output-test-mtl.txt | cut -f2- > reference-test-mtl.txt grep -P '^H' ../output-test-mtl.txt | cut -f3- > generated-test-mtl.txt python compute_rouge.py --hypotheses=generated-test-mtl.txt --references=reference-test-mtl.txt > rouge-test-mtl.txt

Computing PPL though still gives weird results (like 1500 ppl), and it is unclear how to run the eval for fill-1 acc and rouge-20%.

python eval_lm.py <path/to/binarized/data>--task translation --path multitask_checkpoint.pt.1 --gen-subset test --source-lang multitask_source_bpe --target-lang multitask_target_bpe --batch-size 2 --remove-bpe --max-source-positions 4096 --max-target-positions 4096 --skip-invalid-size-inputs-valid-test --model-overrides "{'max_source_positions':4096, 'max_target_positions':4096, 'tokens_per_sample': 10000, 'add_bos_token': False}"