Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
155 stars 28 forks source link

can tokenisation be done just once for every validation item? #5

Closed cmacdonald closed 5 years ago

cmacdonald commented 5 years ago

Validation for large cutoffs and numbers of queries can be slower than training. Are there any optimisations that can be done? E.g. tokenising just once, rather than for each iteration?

seanmacavaney commented 5 years ago

Yes, pre-computing the wordpiece tokens would save time on each validation iteration. But based on some measurements I've taken previously, tokenization is a very small component of the total runtime. The network itself takes considerably more time to run than all other parts. Upcoming enhancements to CUDNN (and corresponding changes to pytorch) should improve the performance of the self-attention components of BERT in the future, improving the validation speed.

In the meantime, perhaps consider using a smaller subset of the validation data.

cmacdonald commented 5 years ago

Yes, have gone that route. Thanks for your input on the measurements. Closing.