BaderLab / saber

Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
https://baderlab.github.io/saber/
MIT License
102 stars 17 forks source link

Truncating should effect only the train set #166

Open JohnGiorgi opened 5 years ago

JohnGiorgi commented 5 years ago

When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.

Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)

Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.