EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Time of Preprocessing #32

Closed yikuan8 closed 3 years ago

yikuan8 commented 3 years ago

Thanks for the great repo. I tested the preprocessing script. It will process 100 notes every minute, which leads to a total ETA of 15 days. Any idea of expediting this or you spent a similar amount of time?

EmilyAlsentzer commented 3 years ago

You're right that the preprocessing script takes a while to run. The code could definitely be sped up (e.g. through map reduce or multiprocessing). I haven't experimented with this myself, but it looks like spacy has a nice example of how to use it with joblib.

If you end up speeding this up, let us know and we'll incorporate into the repo.