Closed Aureatus closed 1 year ago
Rough performance for a few different pdfs:
Get proxy: 114.056ms Get metadata: 2.644ms Get words: 756.888ms Get rare words and definitions: 689.292ms
Get proxy: 147.215ms Get metadata: 4.056ms Get words: 3.419s Get rare words and definitions: 225.678ms
Get proxy: 142.509ms Get metadata: 1.034ms Get words: 1.812s Get rare words and definitions: 521.86ms
Given these figures, the first thing to optimise will be how we approach parsing the base pdf words.
Update: Streaming is not really a viable option, due to the way we match words. There would be no guarantee that the words would be of sufficient rarity, unless we added a threshold of rarity that you could pass to the corpus object or one of it's functions. For now, I would rather take the approach of using a faster pdf package, and keep the implementation details as similar as possible.
Using the tika python binding, it takes inbetween 0.35 and 0.40 seconds to parse the PDF. This is around twice the speed of PDF.js. Worth looking into the Node.js binding for tika.
Update: Decided to use multi threading to speed up PDF processing. Essentially we use nodes worker threads module to split up the PDF across an amount of workers that is viable.
Very rough general benchmarks below (on local machine) :
200 page PDF: 1.2x faster 400 page PDF : 2x faster 800 page PDF : 2.5x faster 1400 page PDF : 3x faster
Very rough general benchmarks below (on railway server):
200 page PDF: 1.62x faster 400 page PDF: 2x faster 800 page PDF : 2.81x faster 1400 page PDF : 3.37x faster
A few possible ways to go about this.