Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
Currently, when you provide Saber a file of pre-trained embeddings, only embeddings for words that appear in the training dataset are loaded into memory. This is fine for evaluation, but hurts performance in two cases:
Transfer learning: The embeddings are only loaded for words in the source dataset. This leads to less coverage for the target dataset.
Deployment: When deploying a trained model for inference, it would be better if all pre-trained embeddings are loaded, minimizing the number of out-of-vocabulary tokens we have to perform inference on.
To fix this problem:
[x] Add load_all_embeddings argument to config.py. Make sure it is added to all other files it needs to appear in.
[x] In embeddings.py, all user to pass load_all flag in order to load all the embeddings.
[x] Update all relevant unit tests.
[x] In saber.py, figure out how to update each datasets type_to_idx mappings.
Currently, when you provide Saber a file of pre-trained embeddings, only embeddings for words that appear in the training dataset are loaded into memory. This is fine for evaluation, but hurts performance in two cases:
To fix this problem:
load_all_embeddings
argument toconfig.py
. Make sure it is added to all other files it needs to appear in.embeddings.py
, all user to passload_all
flag in order to load all the embeddings.saber.py
, figure out how to update each datasetstype_to_idx
mappings.