Accelerate tokenization by batching

dennlinger / summaries

A toolkit for summarization analysis and aspect-based summarizers

MIT License

11 stars 0 forks source link

Accelerate tokenization by batching #57

Open dennlinger opened 1 year ago

dennlinger commented 1 year ago

Currently, for token-based metrics, we're potentially re-computing a lot of tokens for a sample. Having the options to (in parallel) pre-compute tokens and then directly work on those would speed up the process by a significant portion.

dennlinger commented 1 year ago

This is not so trivial, since somehow spaCy starts mucking with too many samples and consumes a lot of main memory. For safety, we have most of the processing with a single thread now.