MattX / Milton

A searchable article database
1 stars 1 forks source link

Avoid parsing HTML twice #18

Closed MattX closed 4 years ago

MattX commented 4 years ago

I've diagnosed the source of many / all Milton save errors to processor timeouts (currently set to 15s).

Running processor locally on Meditations on Moloch (cause it's large), it takes ~1s to parse the dom, 1.5s to sanitize, and 400ms to readabilize. Some of these operations, especially the parsing, may be benfiting from multithreading because I see a lot of difference between user and real time.

I've tried to do some profiling, but didn't get great results, probably because I don't understand how half of these tools work. One thing I've realized is that we parse the DOM twice, which this PR fixes. On my machine, this brings sanitization time from 1.5s to 700ms.

I'm also adding manual timers which are unfortunately gross but that we can look at to determine what the breakdown is in the cloud function.

Tested locally.