Consider programmed completion instead of cache

bdarcus commented 1 year ago

Basic idea

Ihor (see below) suggested using completing-read programmed completion instead of a cache, to avoid performance problems with very large reference libraries.

I think the idea would be to parse and format completions incrementally.

This would obviously be a major change, but I believe it might be so simple, to start with, as allowing the citar-select-ref completion table ("collection" in the completing-read API) to be configurable, though I don't know ATM what other implications that may have, and I can't figure out how best to modify the current code.

Requirements

From a user POV, things shouldn't really change; performance of the UI should just get better and, in particular, more scalable.

Specifically, we need to maintain the current ability to:

seamlessly support multiple global and local bib files, in multiple formats
have the "has" affix live update

.... but to add:

ability to use different sources of data, such as #397, which would use an independent cache; so one should be able to mix bib files that use the org cache to parse and present completion data, with those that use parsebib with bibtex/bibltex/csl-json as now.

Example code

Minimal demos

An example of a very simple programmed completion table, that is not actually dynamic.

(defun my-completion-function (prefix)
  ;; You can just ignore the prefix
  '("Chewbacca" "C3-PO" "Calrissian"))

(completing-read "Chose one: "
                 (completion-table-dynamic #'my-completion-function))

And from the docs for completion-table-dynamic:

(completing-read
 "> "
 (completion-table-dynamic
  (lambda (s)
    (list (concat s "123")
          (concat s "456")))))

Full-featured, highly-performant, example

See org-ql-completing-read, and org-ql-find.

From Ihor

This (caching) is not good enough for really large Org files. Caching everything will be slow no matter what you do given sufficiently large file.

A much faster approach is when search matches are built dynamically: (1) Use re-search-forward of the search term to collect the potentially matching headlines; (2) Limit the number of matches to few hundreds. This is from my experience adapting org-ql to give real-time search for 20Mb Org file. Note that helm-org that is using cache-everything approach is unusable in such scenarios.

Originally posted by @yantar92 in https://github.com/emacs-citar/citar/issues/397#issuecomment-1226953115

That one is just 7.5k bibtex entries. The regexp search + limit matches approach I suggested gives real-time responsiveness on 31k Org headings in my personal notes. So, you can, in fact, get to the snappy performance (with extremely large files).

Originally posted by @yantar92 in https://github.com/emacs-citar/citar/issues/397#issuecomment-1227102683

aikrahguzar commented 1 year ago

I though this was an interesting idea so I whipped up a prototype here https://github.com/aikrahguzar/citar/blob/main/citar-no-cache.el

It works well enough, not as snappy as using the cache but is responsive even for the 6000 entry bibliography. It has its oddities due to searching before parsing. For example if you search for aut you are not going to get any matches because all the matches are going to be for author and these are going to get filtered. Similarly matches from fields that are not in the template also get filtered out.

I am not going to pursue it much farther than this but I will leave it here in case someone wants to use it as a starting point.

bdarcus commented 1 year ago

I thought this was an interesting idea so I whipped up a prototype here https://github.com/aikrahguzar/citar/blob/main/citar-no-cache.el

It works well enough, not as snappy as using the cache but is responsive even for the 6000 entry bibliography. It has its oddities due to searching before parsing. For example if you search for aut you are not going to get any matches because all the matches are going to be for author and these are going to get filtered.

From playing with other implementations, that seems the unavoidable downside of this approach.

We've previously discussed possible customization options relating to performance; perhaps this ends up being one of them.

I am not going to pursue it much farther than this but I will leave it here in case someone wants to use it as a starting point.

Thanks.

If you prefer, I can also open a branch here to put it. I'm agnostic.

Hey - can you review #683 if you have a bit of time?

yantar92 commented 1 year ago

It works well enough, not as snappy as using the cache but is responsive even for the 6000 entry bibliography

It won't be snappy unless you make use of QUERY during search.

aikrahguzar commented 1 year ago

If you prefer, I can also open a branch here to put it. I'm agnostic.

Up to you, I don't have any git preferences usually except not wanting to interact too much with it.

Hey - can you review #683 if you have a bit of time?

I will take a look a little later.

aikrahguzar commented 1 year ago

It won't be snappy unless you make use of QUERY during search.

That is pretty difficult to do well because of the completion styles. They don't give us a regex to run and any assumptions on the completion style can lead to bad results. For example running a regexp assuming basic when the actual style is orderless will lead to most results not appearing. A split query like consult does for async completions can work well, i.e there is an initial segment which is used to generate matches using some simple regex and another one that is filtered using full completion styles.

Edit: Actually the situation is pretty bad even now. I don't really know how to handle a query like "^foo" since what is at the start depends on the template and is not going to be at the start in the bibtex entry. Probably a split query is the most sensible thing to do.

yantar92 commented 1 year ago

That is pretty difficult to do well because of the completion styles.

If you allow arbitrary completion styles then you indeed cannot do much about search optimizations. The best is probably completion-table-with-cache. Note that I suggested the approach that is used by org-ql, which is using a specific custom completion style syntax.

Actually the situation is pretty bad even now. I don't really know how to handle a query like "^foo" since what is at the start depends on the template and is not going to be at the start in the bibtex entry

Sure. I do not think that using default completion style makes much sense here. The completion style generally works with the completion table entries; their format is probably arbitrary and optimized for display, not for the completion. I'd just go with a custom completion style here will separate customization to change it.

aikrahguzar commented 1 year ago

Sure. I do not think that using default completion style makes much sense here. The completion style generally works with the completion table entries; their format is probably arbitrary and optimized for display, not for the completion. I'd just go with a custom completion style here will separate customization to change it.

I tired something like this and now it is faster. Basically the first word of the query is used to find matches and I put checks in place to make sure the keys like "author" don't generate matches, only the values. Completions styles are used for the whole query string to further narrow the matches. This works pretty well enough even though the fields aren't going to be parsed can still generate matches but that isn't as big a problem I think.

bdarcus commented 1 year ago

This is more a polish issue not important ATM, but might be nice if it loaded, say, the last 100 selected references from history initially. I find it a little disconcerting that running citar-no-cache-select-ref doesn't do anything before I start typing.

@jdtsmith - you might want to play with this from @aikrahguzar?

https://github.com/aikrahguzar/citar/blob/main/citar-no-cache.el

Seems there's an issue with unicode?

aikrahguzar commented 1 year ago

This is more a polish issue not important ATM, but might be nice if it loaded, say, the last 100 selected references from history initially. I find it a little disconcerting that running citar-no-cache-select-ref doesn't do anything before I start typing.

To me it shows the first entries it finds. I did push a change that caused an infinite loop but that is resolved now. Maybe try again and see if you see some entries initially.

The problem with getting entries from history would be that there won't be an actual entry attached to them.

aikrahguzar commented 1 year ago

Seems there's an issue with unicode?

Very strange, since the actual parsing of the entry is still by parsebib.

Edit, I have no exactly matched the arguments to parsebib-parse-bib-buffer with those passed by parsebib-parse. Maybe that helps? I don't see any real reason to but it at least fixed the problem with quoted strings, I saw.

jdtsmith commented 1 year ago

Took a quick peek: this seems harder to implement and more error-prone for org-bibtex, since that can have "arbitrary data" outside the :PROPERTIES: drawer.

I guess you could also describe using org-element-cache as "cacheless citar" (for org-bibtex), since it's just re-using the cache org was asked to keep and maintain. If final candidate formatting is the limiting step as you mentioned, using an approach like marginalia would be good: just-in-time annotation-function with a single-session cache of candidate -> formatted candidate. Possibly could even add a 'bibrecord category to marginalia for this purpose.

yantar92 commented 1 year ago

Took a quick peek: this seems harder to implement and more error-prone for org-bibtex, since that can have "arbitrary data" outside the :PROPERTIES: drawer.

You can limit regexp matching to node-properties. This will cut off most of the matches inside headline conents.

jdtsmith commented 1 year ago

Good point. The other problem I see with this approach is it "bakes in" the completion style into the type of regexp in-buffer search being used to find candidates. All the completion styles rely on having the results of all-completions available to do partial-completion, orderless, flex, etc. Orderless in particular is blazingly fast for huge candidate lists because it leverages completion-regexp-list to build complex regexp-based selections in C code.

Here's an approach that seems sensible for org-bibtex files:

Open them if they are not yet opened.
In idle time, pre-warm the org-element cache by doing
(org-element-cache-map 'org-bibtex-extra-parse :next-re org-bibtex-extra-pname :from-pos start :to-pos end) across each org-bibtex buffer. It's ok if things are being edited in the midst of this. The cache is self-repairing.
When citar is invoked, quickly build the candidate list from the pre-warmed org-element-cache. Recall this will be fast even after buffer edits.
Give the list to completing-read, with caching annotation support ala marginalia.

jdtsmith commented 1 year ago

One thing that isn't clear to me: how does citar currently handle merging all the information for a given bib-record, like title, keywords, author, etc., so that all that info can be searched together as if it was just a single candidate?

bdarcus commented 1 year ago

See citar-cache--entries:

https://github.com/emacs-citar/citar/blob/731c0ae81aecca94f7dcc4f2dad51a0cd1c3de68/citar-cache.el#L108-L113

It's one of those details I'm unclear how would be handled in this approach.

bdarcus commented 1 year ago

with caching annotation support ala marginalia

Not following this part. We don't use annotation; we use affixation, and only for the "has" prefix.

And that has to be real-time.

Everything else related to candidate formatting is cached in the :preformatted hash-tables already.

But it's why the first load is slower.

jdtsmith commented 1 year ago

Sorry, now I'm not following, and figure I better before I answer. In a completing-read situation with some references showing like this:

is it the case that the "candidate strings" passed to completing-read include all of the formatted author+year+title+bibcode+type+keywords text? And only the flags on the LHS (none here) are added as a prefix? If that's the case, how do you then recover the bib record from the strings that completing-read returns, which might look like:

Sandage                            1961     The Hubble Atlas of Galaxies                              1961hag..book..    BOOK

By parsing them for their bibcodes?

Update: by tracing completing-read, I see your trick: you place the bibcode at the beginning of the string then make it invisible.

bdarcus commented 1 year ago

is it the case that the "candidate strings" passed to completing-read include all of the formatted author+year+title+bibcode+type+keywords text?

Yes, along with some hidden text.

And only the flags on the LHS (none here) are added as a prefix?

Correct.

If that's the case, how do you then recover the bib record from the strings that completing-read returns.

By looking up the candidate string in the hash-table returned by citar--format-candidates, which returns the citation key used everywhere else in the API to identify a bib record/reference.

https://github.com/emacs-citar/citar/blob/731c0ae81aecca94f7dcc4f2dad51a0cd1c3de68/citar.el#L550

Does that answer your question?

jdtsmith commented 1 year ago

Thanks, I understand better now. I guess the tricky bit here is that unlike say dabbrev, or file-completion, or command-completion, "what the completion text should be" is a bit harder to pin down for a bib record. Just the bibcode? Author(s) + Title? Author+title+year? Keywords? So I do understand this choice.

In terms of caching annotations, that would be relevant if most of the text in the mini-buffer were annotation data, not candidate strings. For example, the candidate string could be (hidden bibcode +) author + title, then everything else is just nicely formatted window-dressing, that can be computed just-in-time for the visible entries, and temporarily cached for good speed scrolling through the list. This is how most consult-* + marginalia stuff works. This would save you the time pre-computing all the formatting, but at the cost of less "match surface". Quite distinct from the current citar approach though.

bdarcus commented 1 year ago

Yes, but the problem with doing that is users then reasonably complain they can't see what data is matching.

Ivy and helm have display transforms (look at what the candidate strings bibtex-completion look like; it's very different than how they're displayed), where you can do this without losing that. But not so in completing-read. And there's debate about whether it even should.

Annotations are purely window dressing; their content can't be completed against.

jdtsmith commented 1 year ago

Yes that's definitely a limitation of the window-dressing approach. One major advantage (in my view) of org-bibtex, is you have much richer searching apparatus than just mini buffer-completion at your finger tips, e.g. with org-agenda style matching language (stuff like +YEAR>2010+AUTHOR={Belv.*e}-TITLE={stellar}).

emacs-citar / citar