BaderLab / saber

Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
https://baderlab.github.io/saber/
MIT License
102 stars 17 forks source link

Provide option to load all pre-trained embeddings. #92

Closed JohnGiorgi closed 5 years ago

JohnGiorgi commented 5 years ago

Currently, when you provide Saber a file of pre-trained embeddings, only embeddings for words that appear in the training dataset are loaded into memory. This is fine for evaluation, but hurts performance in two cases:

  1. Transfer learning: The embeddings are only loaded for words in the source dataset. This leads to less coverage for the target dataset.
  2. Deployment: When deploying a trained model for inference, it would be better if all pre-trained embeddings are loaded, minimizing the number of out-of-vocabulary tokens we have to perform inference on.

To fix this problem: