emacs-citar / citar

Emacs package to quickly find and act on bibliographic references, and edit org, markdown, and latex academic documents.
GNU General Public License v3.0
523 stars 55 forks source link

First load is too slow #659

Closed bdarcus closed 2 years ago

bdarcus commented 2 years ago

On the latest commit (really I think since the change to the new cache system) I've noticed that citar takes 25-30 seconds to load on first being called (and same for any subsequent change in the bibfile). I have a somewhat large .bib file (~7200) but it did not take this long to load it previously. Is there a way to avoid this without splitting my bib file into multiple smaller ones (something I am not intending to do)? Thanks!

Originally posted by @mclearc in https://github.com/emacs-citar/citar/issues/653#issuecomment-1186601992

roshanshariff commented 2 years ago

@mclerc, I'll have to look into what's causing this, since in my testing a bibliography file of that size took much less time. Would you be willing to share your bibliography, privately if you prefer?

mclearc commented 2 years ago

Sure -- here it is. bibfile.bib.zip

mclearc commented 2 years ago

Oh and if it is of use here's my config:

(use-package citar
  :straight (:host github :repo "emacs-citar/citar")
  :commands (citar-open-beref
             citar-open-notes
             citar-insert-citation)
  :bind (:map citar-map
         ("b" .  #'citar-open-beref))
  :custom
  ;; Use with org citation
  (org-cite-global-bibliography `(,lem-bibliography))
  (citar-bibliography `(,lem-bibliography))
  (org-cite-insert-processor 'citar)
  (org-cite-follow-processor 'citar)
  (org-cite-activate-processor 'citar)
  :config
  ;; use embark with at-point
  (setq citar-at-point-function 'embark-act)
  (setq citar-default-action 'citar-open-beref)
  ;; add beref entry for bookends
  (setq citar-additional-fields '("doi" "url" "beref"))
  (setq citar-templates
        `((main . " ${=key= id:15} ${title:48}")
          (suffix . "${author editor:30}  ${=type=:12}  ${=beref=:12} ${tags keywords:*}")
          (preview . "${author editor} (${year issued date}) ${title}, ${journal journaltitle publisher container-title collection-title}.\n")
          (note . ,lem-citar-note)))
  (when (display-graphic-p)
    (setq citar-symbols
          `((file ,(all-the-icons-octicon "file-pdf"      :face 'error) . " ")
            (note ,(all-the-icons-octicon "file-text"     :face 'warning) . " ")
            (link ,(all-the-icons-octicon "link-external" :face 'org-link) . " "))))
  ;; edit notes
  (setq citar-notes-paths `(,lem-bib-notes))

  ;; Citar & Bookends
  (defun citar-get-beref (entry)
    (let* ((field (citar-has-a-value '(beref) entry))
           (base-url (pcase field
                       ('beref "bookends://sonnysoftware.com/"))))
      (when field
        (concat base-url (citar-get-value field entry)))))

  (defun citar-open-beref (keys-entries)
    "Open bookends link associated with the KEYS-ENTRIES in bookends.

With prefix, rebuild the cache before offering candidates."
    (interactive (list (citar-select-refs
                        :rebuild-cache current-prefix-arg)))
    (dolist (key-entry keys-entries)
      (let ((link (citar-get-beref (cdr key-entry))))
        (if link
            (browse-url-default-browser link)
          (message "No ref found for %s" key-entry))))))
bdarcus commented 2 years ago

Just putting this note from Roshan here:

I did some preliminary investigation ... and on my (fairly old) computer I don't see the problem right away (loads that bibliography in ~5 sec). I'll have to investigate a little more to see where the slowdown is.

aikrahguzar commented 2 years ago

On some profiling on a file with about 1500 entries, it seems like about half the time is spent inside parsebib-clean-TeX-markup which does much more than the rough cleanup we had before and this might be responsible for a significant chunk of time. Setting parsebib-TeX-markup-replace-alist to nil makes generating the cache instant for me while with the defaut value it takes a few seconds.

I think this was a recent addition to parsebib so might have been responsible for the slowdown and not the new caching mechanism.

Edit: For me the evaluating (benchmark-run-compiled 3 (citar-cache--get-bibliography "bibfile" t)) takes 2.2 sec and 13.1 sec respectively for parsebib-TeX-markup-replace-alist bound to nil vs the default value.

bdarcus commented 2 years ago

That might make sense @aikrahguzar.

@joostkremers added that on June 16, so actually before we merged the cache update.

Does that timing fit your experience @mclearc?

Still, that doesn't explain the discrepancy in performance between the two of them.

If, in fact, that explains it, any thoughts on WTD @roshanshariff?

I wonder also if there's room for optimizations in that parsebib function (though I can't see how)?

joostkremers commented 2 years ago

Well, parsebib-clean-TeX-markup goes through every item in parsebib-TeX-markup-replace-alist for every field value and runs replace-regexp-in-string on it, which is of course a hell of a lot of regexp matching. I'm not surprised it takes time.

One could remove some items from parsebib-TeX-markup-replace-alist and keep only the ones that matter to oneself. The value of parsebib-TeX-markup-replace-alist isn't very transparent, but the definition in parsebib.el is. It'd be easy to copy that to one's personal init file and customise it.

I wonder also if there's room for optimizations in that parsebib function (though I can't see how)?

Profiling the code suggest that replace-regexp-in-string is the most time-consuming part. Perhaps building larger, more complex regexps that test a whole set of TeX markup items at once instead of testing each one separately might speed things up. (I'd ask on Emacs devel first if that could really speed up processing, because it'd be quite a bit of work to implement.)

Another option that comes to mind: citar could perhaps do the loading of the .bib file in the background, as soon as the user opens a .tex, .org or Markdown file. That wouldn't speed things up, but it would make the bibliography instantly available when the user accesses it for the first time, provided it's not accessed immediately after opening the text file (which I suspect will often be the case).

Alternatively, perhaps the conversion can be done on the fly, only for the field values that are actually shown to the user. I really have no idea if that would even be possible to implement, though.

aikrahguzar commented 2 years ago

Still, that doesn't explain the discrepancy in performance between the two of them.

My own experience on a less than year old laptop seems more in line with 15-20 seconds for a bib file with 7200 entries.

aikrahguzar commented 2 years ago

Well, parsebib-clean-TeX-markup goes through every item in parsebib-TeX-markup-replace-alist for every field value and runs replace-regexp-in-string on it, which is of course a hell of a lot of regexp matching. I'm not surprised it takes time.

I think this time is well spent and the big default is a good thing since it would be a pain to recreate it in a personal thing and it does lead to much nicer looking results. Somewhere in the citar documentation people with large bib files should be aware of it and then as you said they can customize parsebib-TeX-markup-replace-alist as they wish.

Profiling the code suggest that replace-regexp-in-string is the most time-consuming part. Perhaps building larger, more complex regexps that test a whole set of TeX markup items at once instead of testing each one separately might speed things up. (I'd ask on Emacs devel first if that could really speed up processing, because it'd be quite a bit of work to implement.)

My impression is that building a huge regexp with a lot of parenthesized sub-expressions should be faster. I think another that can help is to make elements of parsebib-TeX-markup-replace-alist field specific. From a cursory look most of them will are useful primarily for titles and some other for names. But things like dates, journal names, files names etc probably don't need any replacements. This should lead to much fewer regexps and will also allow citar to get rid of citar-display-transform-functions.

bdarcus commented 2 years ago

I suppose one simple thing we could do is a shadow variable, that we use to let-bind the parsebib defvar, so we can:

  1. Maybe restrict this to just a few key fields.
  2. Allow users to customize the list.

One could also always use CSL-JSON instead, and/or there's #653, which I still think may be the ideal solution longer term, once we've optimized everything else.

joostkremers commented 2 years ago

My impression is that building a huge regexp with a lot of parenthesized sub-expressions should be faster.

If I find some time, I can try it out and see if it brings any improvements. My laptop is grossly underpowered, so it will be noticeable immediately. :laughing:

One possibly complicating factor is that the replacement in this case needs to be a function, which would need to look up the actual replacement string in an alist. That may slow things down again.

I think another that can help is to make elements of parsebib-TeX-markup-replace-alist field specific. From a cursory look most of them will are useful primarily for titles and some other for names. But things like dates, journal names, files names etc probably don't need any replacements. This should lead to much fewer regexps and will also allow citar to get rid of citar-display-transform-functions.

Yeah, I added a commit a few days ago that provides an exclusion list. Currently, it only contains file, doi and url, but it may make sense to add more fields to it.

In fact, given the resource-intensiveness, which I didn't think about before, it may actually make sense to make it an inclusion list. @bdarcus what do you think?

bdarcus commented 2 years ago

given the resource-intensiveness, which I didn't think about before, it may actually make sense to make it an inclusion list. @bdarcus what do you think?

I think probably yes also.

I'd be interested in input also from others on this thread though.

@roshanshariff, for example, wrote the new cache code, and has been doing all the performance tuning on our side.

aikrahguzar commented 2 years ago

One possibly complicating factor is that the replacement in this case needs to be a function, which would need to look up the actual replacement string in an alist. That may slow things down again.

Since the regexp is build from an alist and we know the order of subexpressions it can be loop over the alist instead of lookups for each match. I think that shouldn't be too bad.

Yeah, I added a commit a few days ago that provides an exclusion list. Currently, it only contains file, doi and url, but it may make sense to add more fields to it.

In fact, given the resource-intensiveness, which I didn't think about before, it may actually make sense to make it an inclusion list. @bdarcus what do you think?

I think it makes sense to allow regexps to be specific to the field, so something like an alist where the car is the list of fields and the cdr is the alist of regexps and replacements. I am not sure that is the best design but I think it makes sense to be able to target regexps with some specificity.

joostkremers commented 2 years ago

One possibly complicating factor is that the replacement in this case needs to be a function, which would need to look up the actual replacement string in an alist. That may slow things down again.

Since the regexp is build from an alist and we know the order of subexpressions it can be loop over the alist instead of lookups for each match. I think that shouldn't be too bad.

Hmm, interesting idea. I'll see if I can implement it.

I think it makes sense to allow regexps to be specific to the field, so something like an alist where the car is the list of fields and the cdr is the alist of regexps and replacements. I am not sure that is the best design but I think it makes sense to be able to target regexps with some specificity.

I'm not sure that would be very useful. I don't think there are TeX constructs that are (much) more likely to occur in some fields but not in others. Do you have any specific examples in mind?

I do think it makes sense to limit parsebib-clean-TeX-markup to a small set of fields, like author and editor, title, perhaps journaltitle, institution.

bdarcus commented 2 years ago

FWIW, I've removed this from the 1.0 milestone, since I don't think this should hold us up from tagging 1.0. It seems most of the code changes will happen on the parsebib side, and in any case won't impact our API.

aikrahguzar commented 2 years ago

I'm not sure that would be very useful. I don't think there are TeX constructs that are (much) more likely to occur in some fields but not in others. Do you have any specific examples in mind?

You are right about this, the only example I can think of different calligraphic letters that are only likely to occur in the title of math papers and accents that are more likely in the author names (but that is not going to true of fields where not all papers are in English). My motivation was not actually cleaning markup but to use this mechanism as a general way of getting a pretty representation of field values since it is already very close to that. One example that has come up with regards to citar is treating first/last names in a more consistent manner, which gets harder if the markup is removed and it makes sense to have something that regexps matching various possibilities for first/last names just for the authors field. But that can be also be done by excluding authors from cleanup and then dealing with the field separately.

roshanshariff commented 2 years ago

Sorry, I was out of the loop for a bit. Thanks @aikrahguzar for digging into this more! That would explain why I couldn't initially reproduce the slowdown; my parsebib was too old.

I won't be able to look at the code for a few days, but based on the discussion here, my initial impression is that we shouldn't perform the de-texification as part of parsebib-parse. Instead, we should store the raw strings in the cache and clean them up only in citar-get-display-value (as was the case with the old cleanup function).

This way, code that needs the raw field value (like for the file field) will get it via citar-get-value, whereas e.g. the candidate strings will use the detexified version because they use citar-get-display-value. This also means that only the fields used in the candidate template will get detexified when a bibliography is updated.

This might not be enough to completely solve the performance problem. I have some ideas for futher mitigations, but will have to look at the code to see what's reasonable. I'll probably get to that on the weekend.

bdarcus commented 2 years ago

The scenario where @mclearc was experiencing pain was when a very large file changes frequently, which will probably never exactly be fast (I still think we need #653 for that).

Wouldn't your suggested changes here merely move where the work is done, but otherwise not improve overall performance?

Or that the improvement would be marginal?

I guess you'll only know when you have a chance to test though.

In any case, it would still help us (and other parsebib consumers) if @joostkremers can find a way to increase the performance of that function.

EDIT: I guess one potential other advantage of having :entries be even unparsed (e.g. more structured data that would then be formatted for display independently) would be it would be more consistent with the shared cache idea we've previously discussed. But there'd be downsides to that too.

Benchmarks, etc.

BibTeX

Benchmark comparisons ATM, with a file with less than 1,000 entries:

ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display t))
(41.728177873 130 23.535983567999992)

ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display nil))
(1.5515592139999999 2 0.36761948199999495)

FWIW, I don't see any performance difference in limiting the :fields it parses.

ELISP> (benchmark-run-compiled 10 (parsebib-parse citar-bibliography :display nil :fields (list "title")))
(1.567477618 2 0.3720656209999902)

CSL-JSON

Finally, we do need to remember that we, and parsebib, also support CSL-JSON, which is faster to parse than bibtex in general, and also does not have the markup to strip (though does have an informal HTML-like tag system).

Note this file has more entries than the one I use above.

ELISP> (benchmark-run-compiled 10 (parsebib-parse (list "~/org/bib/MyLibrary.json") :display nil))
(0.591864735 1 0.189613928)

ELISP> (benchmark-run-compiled 10 (parsebib-parse (list "~/org/bib/MyLibrary.json") :display t))
(1.693483346 3 0.7157699330000042)

So performance with the JSON is really good, even with :display t.

Here, BTW, is what the values look like with :display nil:

 ((id . "zonszein_2020-amazing-image")
  (abstract . "Amazing image of rally in Tel Aviv now against corrupt Netanyahu and the damage to democracy. Every protestor standing on Xs marking 6 feet apart from one another. https://t.co/Qhd9h6Za6s\" / Twitter")
  (accessed
   (date-parts .
    [[2020 4 19]]))
  (author .
   [((family . "Zonszein")
     (given . "Mairav"))])
  (container-title . "Twitter")
  (issued
   (date-parts .
    [[2020 4 19]]))
  (language . "en")
  (source . "twitter.com")
  (title . "Amazing image of rally in Tel Aviv ...")
  (title-short . "Mairav Zonszein מרב זונשיין on Twitter")
  (type . "webpage")
  (URL . "https://twitter.com/MairavZ/status/1251928834267443201"))

If we go that direction, the cache will need to track what format the raw data is under :props.

aikrahguzar commented 2 years ago

The scenario where @mclearc was experiencing pain was when a very large file changes frequently, which will probably never exactly be fast (I still think we need #653 for that).

Wouldn't your suggested changes here merely move where the work is done, but otherwise not improve overall performance?

Or that the improvement would be marginal?

Removing markup is a one time but expensive operation and that seems like a very good use case for caching. I think trying to do it on the fly is unlikely to work since we can't do is lazily just for the entries that are on the screen.

And it is slow for a good reason. So I think for managing large bibliographies we have to document it and direct people to have less markup and perhaps make it possible to do it just for citar by providing a variable which we let bind.

bdarcus commented 2 years ago

I think trying to do it on the fly is unlikely to work since we can't do is lazily just for the entries that are on the screen.

FWIW, I believe this is a current limitation of completing-read.

In Helm, and I think also Ivy, they support this via the notion of display transforms, which would be nice to have here.

But this project has always tried to find creative ways to work around completing-read limitations :-)

minad commented 2 years ago

@bdarcus

FWIW, I believe this is a current limitation of completing-read. In Helm, and I think also Ivy, they support this via the notion of display transforms, which would be nice to have here. But this project has always tried to find creative ways to work around completing-read limitations :-)

I don't think so. As soon as you present candidates to the user for completion, they must be in an appropriate form to match against (in your case the parsed bibtex entries converted to a displayable string). Display transformers only work for superficial candidate formatting/transformation, e.g., changing the colors of directories in find-file or modes in M-x.

If you would use something like consult--async you could load the candidates in the background but I am not sure if that would lead to a satisfactory UI. For consult-grep it is more expected that the candidates appear in batches.

There is probably no magic bullet here:

  1. Display the loading progress using a progress reporter.
  2. Optimize the parsing by reducing the amount of allocations and calls to replace-regexp-in-string. Note that you can pass a function as REP argument.
  3. Make creative use of caching, idle loading, etc. The bib is usually not modified that often and calling Citar functions may also not be the first thing you do, so preloading should work?
bdarcus commented 2 years ago

@minad

3. The bib is usually not modified that often and calling Citar functions may also not be the first thing you do, so preloading should work?

Yes, unless you have one very large bib file (in this case, with 7200 entries) that does change frequently. In this case, the OP was seeing load times of 25-30 seconds.

But we've always known that case can only be optimized so much.

Still, some mix of your suggestions (and maybe @aikrahguzar's) for the parsebib function and for citar should help a lot.

roshanshariff commented 2 years ago

I think caching the de-TeXed/prettified fields is reasonable. There only needs to be one universal cache for this, since the rules for transforming field values to their display form don't change. And with some clever use of weak references, you never have to clear this cache (it won't keep growing indefinitely). This should speed up updating bib files on modification, because presumably most of the strings in the bib file will remain the same.

We still need the clean-up function to be fast so the initial load doesn't take too long, but I think optimizing the regexp search/replace will accomplish that.

I'll be experimenting with this over the next couple of days.

bdarcus commented 2 years ago

I'll be experimenting with this over the next couple of days.

Not to be pushy @roshanshariff, but any update?

If this looks like it may be more involved, I think I'll just do the doom module update now (which is mostly about citar-embark, which also fixes a bug there), and bump it again after.

roshanshariff commented 2 years ago

Yeah, it's going to be a little more involved; there are some niggling design details to get right. Feel free to do the update in the meantime, @bdarcus.

bdarcus commented 2 years ago

Doom update is merged; it includes now a specific parsebib package! line, so they pin specific versions.

BTW, @roshanshariff, if you haven't noticed already, @aikrahguzar has a linked draft parsebib PR that speeds up that function pretty dramatically.

It seems it will be a couple of weeks before that can be merged over there (Joost is on vacation), and when it is, it should largely fix this problem.

~So I'm thinking again about just tagging 1.0 here now. Let me know if you have any objections.~ I tagged 1.0.

https://www.reddit.com/r/emacs/comments/wk4dur/citar_10_citarorgroam_doom_biblio_update/

bdarcus commented 2 years ago

Just to update, when I tested @aikrahguzar's draft PR code at https://github.com/joostkremers/parsebib/pull/25 about a week ago, it improved loading performance by about 50%. I'd assume that would get us back to what it was before these big changes.

joostkremers commented 2 years ago

As I mentioned before, I think it might be a good idea to use an inclusion list after all, rather than an exclusion list. For one, it wouldn't result in an unsuspected increase in parsing time if a user decides they want to retrieve some additional fields from the .bib file.

So parsebib-clean-TeX-markup-excluded-fields would be replaced with parsebib-clean-TeX-markup-included-fields, and I think it would make sense to only include the fields author, editor, title, journal and journaltitle in it.

If I don't get any complaints, remarks or comments, I'll make this change some time next week.

bdarcus commented 2 years ago

Joost merged this PR, which should fix this issue.

https://github.com/joostkremers/parsebib/pull/25

Please test and report back if not.