Truncating should effect only the train set

When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.

Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)

Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.

BaderLab / saber

Truncating should effect only the train set #166