apache / incubator-annotator

Apache Annotator provides annotation enabling code for browsers, servers, and humans.
https://annotator.apache.org/
Apache License 2.0
218 stars 44 forks source link

Improve performance for large documents with many annotations #130

Open vrish88 opened 2 years ago

vrish88 commented 2 years ago

Hello and thank you for this wonderful project. It's provided some excellent shoulders to stand on.

Context

I'm extracting footnotes embedded in markdown and converting them annotations. Some of these markdown files have over 500k characters in them and have over 100 footnotes. After a quite circuitous route, I'm using mdast/hast/remark to convert the markdown into html and then loading the html into a jsdom Document.

The basic flow is like this:

The Problem

I found that extracting footnotes for some of the larger files was taking 7 - 10 minutes to process. Running a profiler, it looked like 70% of the time was spent determining if the node intersected the document/scope. image

That call is happening when the node is being converted to a chunk, which happens many times, per annotation. It is also only being used to ensure that the node is apart of the document (as far as I can tell).

The Solution

This PR removes that check. It improved the performance on my machine by 75% for the large files.

Behaviorally I think it is the same. The two things which invoke nodeToChunk appear to be already checking if those nodes are a part of the scope.

Treora commented 1 year ago

Sorry for the silence, and thanks a lot for the contribution! Glad to hear this library is useful to you; we’d love to hear in what way it helps you, for what purpose, and how it could be more helpful still!

So far we not focussed on actual performance, though we have kept it in mind while designing the APIs (hence the small functions, the generators, async functions everywhere, etc.). Profiling the code execution is a great start already.

I think I added that check myself. Within the class, you may well be right that we do not need such a check at all. I probably added for in case somebody would use the chunker in other ways, to not return an invalid chunk if the given node is outside of the scope. Perhaps we can split the method into a private method without the check, and keep the public method that does the check and then runs the private method?