Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.
Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)
Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.
When batching data, Saber truncates / right-pads each sequence to match a length of
saber.constants.MAX_SENT_LEN
.Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (
dataset_folder/valid.*
anddataset_folder/test.*
)Furthermore, a user should be able to specify some kind of percentile (e.g.
0.99
), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.