Increase performance of PDF parsing

Aureatus / readvocab

Readvocab is an app that allows users to upload PDF's, and see a descending list of the rarest words found within per the rarity of words in the English language.

https://readvocab.vercel.app

2 stars 1 forks source link

Increase performance of PDF parsing #18

Closed Aureatus closed 1 year ago

Aureatus commented 1 year ago

A few possible ways to go about this.

Try some other Nodejs pdf parsing packages to see if there are any that are much more performant than PDF.js (pdf2json, pdfreader)
Build a microservice in a more performant language to parse the pdf. Potential avenues being Go or rust.
Stream the pdf files pages alongside getting rare words and definitions, so we aren't parsing a whole pdf when we could only need the first 10 pages for example.
- This will require a change in the current workflow and way of handling requests/responses, but is probably worth it.

Aureatus commented 1 year ago

Rough performance for a few different pdfs:

Nietzche good and evil

Get proxy: 114.056ms Get metadata: 2.644ms Get words: 756.888ms Get rare words and definitions: 689.292ms

Name of the wind

Get proxy: 147.215ms Get metadata: 4.056ms Get words: 3.419s Get rare words and definitions: 225.678ms

Hound of baskervilles

Get proxy: 142.509ms Get metadata: 1.034ms Get words: 1.812s Get rare words and definitions: 521.86ms

Given these figures, the first thing to optimise will be how we approach parsing the base pdf words.

Aureatus commented 1 year ago

Update: Streaming is not really a viable option, due to the way we match words. There would be no guarantee that the words would be of sufficient rarity, unless we added a threshold of rarity that you could pass to the corpus object or one of it's functions. For now, I would rather take the approach of using a faster pdf package, and keep the implementation details as similar as possible.

Aureatus commented 1 year ago

Using the tika python binding, it takes inbetween 0.35 and 0.40 seconds to parse the PDF. This is around twice the speed of PDF.js. Worth looking into the Node.js binding for tika.

Seems to not play nice with newer node versions, haven't implemented it.

Aureatus commented 1 year ago

Update: Decided to use multi threading to speed up PDF processing. Essentially we use nodes worker threads module to split up the PDF across an amount of workers that is viable.

Very rough general benchmarks below (on local machine) :

200 page PDF: 1.2x faster 400 page PDF : 2x faster 800 page PDF : 2.5x faster 1400 page PDF : 3x faster

Aureatus commented 1 year ago

Very rough general benchmarks below (on railway server):

200 page PDF: 1.62x faster 400 page PDF: 2x faster 800 page PDF : 2.81x faster 1400 page PDF : 3.37x faster