Patch description
During data creation and model training (Q-to-A) I've come across some obstacles (which I also described in #10). Many of them have been fixed meanwhile, and the rest is addressed in this pull request:
Every second example is skipped in process_data_to_source_target.py
The input file names in process_data_to_source_target.py do not match the ones from data_creation/finalize_qda.py
The readme does not contain information on max-target-positions etc. which are necessary to run the generation script. For instance, running the generation.py without max source and target positions skips 9893 examples from the validation split (which is almost all of them)
Testing steps
For the readme try to run the old commands without adaptation. For instance, generation without setting --max-source-positions 4096 --max-target-positions 4096 will skip almost all examples (see log below).
Logs
| WARNING: 9893 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[9052, 6593, 4710, 9081, 8042, 5242, 890, 7521, 7079, 3455]
Patch description During data creation and model training (Q-to-A) I've come across some obstacles (which I also described in #10). Many of them have been fixed meanwhile, and the rest is addressed in this pull request:
Testing steps For the readme try to run the old commands without adaptation. For instance, generation without setting --max-source-positions 4096 --max-target-positions 4096 will skip almost all examples (see log below).
Logs
Other information