Preload cache in the background

dj311 commented 1 year ago

Is your feature request related to a problem? Please describe. I use a precompiled bibliography file that is pretty hefty and tends to cause performance issues in the emacs packages I've tried. Once citar has filled its cache, it tends to be relatively snappy. But the first time I use it in an editing session it blocks emacs for around 2 minutes.

Describe the solution you'd like It would be sweet if citar loaded the bibliography cache in the background when a new buffer is opened. That way the first time I try to insert/modify a citation it should be more responsive.

Describe alternatives you've considered N/A

Additional context N/A

I've got a working prototype of this in my config.

It uses emacs-async to run a batch of (citar--update-bibliography bib) calls in the background (triggered by latex-mode-hook). When you next try to use citar, say by calling citar-insert-edit, it will check for an in-progress background task, wait for it to complete (if it hasn't already) and initialise citar's cache with the preprocessed bib objects, then continue.

Let me know if you want this feature for the package and I can put together a pull request.

(Note that this implementation probably needs quite a bit of work. E.g. the first call to citar-insert-edit is still a little slow for me. It takes around 30 seconds (compared to around 5 seconds for follow-up calls). I'm not sure of the cause).

bdarcus commented 1 year ago

Hoping @roshanshariff can weigh in on this, but will also note this is the kind of example we've previously discussed as likely better handled by #681. @aikrahguzar had a promising prototype, but there were still issues to sort out IIRC.

dj311 commented 1 year ago

Sweet -- thanks for taking a look. #681 seems like a nice alternative approach. There are some rough edges in the emacs-async approach that might make it hard to justify this feature in the main citar package. I'll keep an eye on both these issues (and let me know if you need anything).

roshanshariff commented 1 year ago

We considered implementing this, but there is a big bottleneck with the emacs-async approach. It spawns an emacs sub-process to parse the bibliography file into a hash table containing the strings needed by citar, but then the subprocess needs to serialize that hash table as text and send it over a pipe to the main emacs process, which then parses the text back into a hash table. I'm guessing this latter serialization/deserialization is taking ~30 seconds for you. So, in the end, you would still end up with a ~2 minute parse time (during which emacs is responsive and an inferior emacs process is working in the background) and then 30 seconds where it's unresponsive, receiving the parsed data back. This is an improvement, but it's still very undesirable and I'm not sure whether it's worth the extra complexity.

I think a better solution in the long run would be to use emacs' multithreading support; we could do the parsing in a separate thread (with no user interaction or buffer modification), and the resulting hash table would be immediately accessible because it's in the same memory space. But, at the time, I was put off by the apparent immaturity of emacs multithreading. I'm not sure if the situation has improved in emacs 29. I don't have the time to experiment with it again right now, but it would be good to get a status update.

Yet another option would be to make parsebib's parsing interruptible: it could do its parsing inside a while-no-input loop, perhaps one iteration per entry. After parsing each entry, it could save how far in the file it's reached. If emacs receives any input, the parsing would be paused and then resumed on idle, continuing from where it left off. I haven't looked deeply into how much work this would be, but it should be possible.

aikrahguzar commented 1 year ago

Yet another option would be to make parsebib's parsing interruptible: it could do its parsing inside a while-no-input loop, perhaps one iteration per entry. After parsing each entry, it could save how far in the file it's reached. If emacs receives any input, the parsing would be paused and then resumed on idle, continuing from where it left off. I haven't looked deeply into how much work this would be, but it should be possible.

Where I left off had sort of a version of this: it would start with a temporary buffer in which entries were inserted. At each call of the completion table the buffer was searched with the first word of the query and limited number of matches were parsed removed from the buffer. The parsed entries were added to the cache. The cache was then passed as the collection. So in the beginning the search results were not the optimal ones and gradually as more searches were done the amount of cached entries increased and so the results became better but on the other hand some matches were presented immediately without wait.

roshanshariff commented 1 year ago

Ah, interesting approach @aikrahguzar. Do you know whether it's possible to stream more entries in (i.e update the collection as entries are parsed) with this approach, even without the user updating the query string? I know that consult does this for things like consult-grep, but I wasn't sure if that's done generically through the completing-read interface, or by hooking into the internals of particular completion front-ends.

aikrahguzar commented 1 year ago

Ah, interesting approach @aikrahguzar. Do you know whether it's possible to stream more entries in (i.e update the collection as entries are parsed) with this approach, even without the user updating the query string? I know that consult does this for things like consult-grep, but I wasn't sure if that's done generically through the completing-read interface, or by hooking into the internals of particular completion front-ends.

I don't think this is can be done just with completing-read, some minibuffer hacking will be required. I think for consult this done by consult--completion-refresh-hook which has consult-vertico--refresh as its sole entry for me. Without relying on completion ui the best that can be done is probably to call abort-minibuffers and then call completing-read again with the last query as initial-input but I think it is going to cause flashing.

dj311 commented 1 year ago

Thanks for looking into this!

Re. performance issues with the async prototype:

We considered implementing this, but there is a big bottleneck with the emacs-async approach. It spawns an emacs sub-process to parse the bibliography file into a hash table containing the strings needed by citar, but then the subprocess needs to serialize that hash table as text and send it over a pipe to the main emacs process, which then parses the text back into a hash table. I'm guessing this latter serialization/deserialization is taking ~30 seconds for you. So, in the end, you would still end up with a ~2 minute parse time (during which emacs is responsive and an inferior emacs process is working in the background) and then 30 seconds where it's unresponsive, receiving the parsed data back. This is an improvement, but it's still very undesirable and I'm not sure whether it's worth the extra complexity.

I figured out what the bottleneck was: the format string (and preformatted strings) weren't surviving async.el's serialisation. This meant that when citar checked the bibliography, it needed to regenerate the preformatted strings. I've tweaked my personal config to fix that problem (and updated the gist).

With these fixes, it takes around 10 seconds before the first render after calling citar-insert-edit (the time split is around 2/3 finishing the async work, 1/3 citar-format-candidates). Subsequent renders take 3-5 seconds.

I think a better solution in the long run would be to use emacs' multithreading support;

Agreed. I think the issues I've had with serialisation and deserialisation point towards emacs mutithreading being a more stable and performant option than using async-emacs. And it looks like the implementation should be simpler than the current async prototype.

Re. programmed completion:

It seems like programmed completion is the way to go!

I think my performance tests above demonstrate this quite well. Even with the cache loaded and the preformatted strings populated, citar-format-candidates still takes around 3-5 seconds everytime I initiate a command like citar-insert-edit. A programmed completion function that allows citar-format-candidates to only process relevant candidates would help a lot with that.

What do you think of this approach?

Parse and preformat bibliographies in the background when the buffer is opened (using multithreading)
Use programmed completion so that citar-format-candidates need only format relevant matches.

(With the caveat that I don't really know much about bibtex or citar) I think this combination might be optimal because if we were to selectively parse the bibtex file using programmed completion, this may cause issues for bibliographies that use cross-refs?

bdarcus commented 1 year ago

I'll defer to @roshanshariff and @aikrahguzar on the technical details, but just note: we don't only do bibtex/biblatex; we also support CSL JSON.

emacs-citar / citar

Preload cache in the background #779