Limit re-fetches of the Zotero library

retorquere commented 4 years ago

It looks like the number of times citr requests the full library from BBT can be optimized. For large libraries this should yield a performance improvement.

I'm open to adding an endpoint in BBT that would allow testing whether the library has changed since last fetch, but to do this effectively, I must understand what triggers a re-read of the BBT-produced bib file, and whether it's cached on the citr end

crsh commented 4 years ago

That would be great; the performance of that connection, indeed, leaves room for improvements. ;)

Currently, the bib-file exposed by BBT is only read on-demand if the user connects to BBT for the first time or subsequently requests to reload the library (e.g., because of modified or added references). This gif may give you a rough idea (notice the Reload libraries action link). In between manual requests the library is cached in R and accessed directly. Does that help?

Is it possible that the multiple requests you are seeing target different libraries (main library and group libraries)?

Full disclosure, I'm currently exploring whether a different approach to searching Zotero may be better (in short, BBT CAYW search, zotxt, and pandoc-zotxt). I'm not sure whether this is feasible, though.

retorquere commented 4 years ago

The behavior described here sounds reasonable and then I'd see no reason to change anything, but @jrennstich describes clicking connect once, and I see 3 requests for the full library, all for the main user library. I've looked in his DB and there are no groups set up in the copy I have.

The problem for him is exacerbated by a yet-unfixed problem that full library requests take unreasonably long -- this is on me to fix, but his computer took an unfortunate moment (always unfortunate for @jrennstich of course, but unfortunate in the sense that I don't like having open unsolved problems on my plate) to demand repairs.

WRT speeding up bib access -- I've done some recent (5.2.X) performance work that should make fetches substantially less painful, but perhaps not enough for your use-case. BBT exports are relatively heavyweight, and even with a fully filled cache, 24k items take 10-15 seconds to lay out on disk.

pandoc-zotxt should work I think. I can't see why I'd object to this -- BBT is good at solving some problems, not others, and I hold no illusions on how speedy it is 🙄 .

Another option would be to expose an endpoint where citr test whether an auto-export has been set up for a specific path, and set one up if not. That would fully decouple the two while keeping the cooperation in place; potential problem is that you would have to detect when the file on disk changes. The write to the file by BBT is atomic (I write to a temp file and once done it is renamed to the target) so you'd not get partial results, but still. OTOH, in that connect screen it shouldn't be too hard to detect that the file time has been updated since last check.

crsh commented 4 years ago

The behavior described here sounds reasonable and then I'd see no reason to change anything, but @jrennstich describes clicking connect once, and I see 3 requests for the full library, all for the main user library. I've looked in his DB and there are no groups set up in the copy I have.

Hmm, I'll have to check dig into this. Unfortunately, I'm completely swamped right now and won't get around to it before April.

Another option would be to expose an endpoint where citr test whether an auto-export has been set up for a specific path, and set one up if not. That would fully decouple the two while keeping the cooperation in place; potential problem is that you would have to detect when the file on disk changes. The write to the file by BBT is atomic (I write to a temp file and once done it is renamed to the target) so you'd not get partial results, but still. OTOH, in that connect screen it shouldn't be too hard to detect that the file time has been updated since last check.

This also sounds like a useful solution to decouple, reloading the bibliography from the addin. Checking when the file changed on disk should be easy enough. Do you think an additional speed-up could be gained from supporting CSL JSON rather than relying on BibTeX as suggested in https://github.com/crsh/citr/issues/59?

retorquere commented 4 years ago

I thought CSL JSON was going to easily beat the TeX export formats on speed, but that turns out to be false at the moment. For context, my CSL exporters do barely anything but re-use the existing Zotero CSL converters, but the combination looks to be slower than BBT TeX, which is strange, because the cold-cache version does a lot less than the TeX formats, and the hot-cache scenario should simply be the same, roughly. In any case, there's still benefits to using CSL:

much easier, reliable, and faster to parse for stuff like citekeys, titles, etc.
if you can use CSL in your pipeline then you're probably using either citeproc or pandoc, and in both cases using .bib as a format is actually undesirable -- most likely you're translating zotero -> bibtex -> csl -> bibliography anyhow, and each step before the bibliography is lossy. Much better to just go zotero -> csl -> bibliography

Simple, non-scientific test: export of 24k items:

Better BibTeX, cold cache: 120s
Better BibTeX, hot cache: 17s
Better CSL JSON, cold cache: 278s
Better CSL JSON, hot cache: 41s

I'm going to look into the performance problem with CSL. This should not be the case.

crsh commented 4 years ago

Thanks for the benchmark, that's interesting.

I agree using JSON would avoid lossy conversion between formats. I currently use BibTeX because it works with pandoc-citeproc but also with biblatex or natbib, which some users prefer. In this sense, it's a format that's applicable to a wider set of usecases that I have come across.

retorquere commented 4 years ago

We've been able to implement some substantial speedups in https://github.com/retorquere/zotero-better-bibtex/issues/1389; I'm doing some tidying up, and then I'll cut a new release in the next few days. But I'm still open to create an endpoint that citr can talk to to set up an auto-export in an automatic way.

It'd also be possible to create an endpoint to query for collections so not the entirely library needs to be fetched, which would net a performance benefit but which would make the UI on the citr side more involved.

retorquere commented 4 years ago

I agree using JSON would avoid lossy conversion between formats. I currently use BibTeX because it works with pandoc-citeproc but also with biblatex or natbib, which some users prefer. In this sense, it's a format that's applicable to a wider set of usecases that I have come across.

I don't mean to be (too) pedantic about this, but that's a flexibility win at the cost of a quality loss.

retorquere commented 4 years ago

The CSL performance issue has been fixed in 5.2.16.

crsh commented 4 years ago

Thanks, I'll take a look the next chance I get!

crsh / citr

Limit re-fetches of the Zotero library #58