Better approaches to importing of the Zotero database

Robinlovelace commented 4 years ago

It's frustrating when citr freezes your session so I thought I'd have a play with the future package. Results seem promising so far, so thought I'd report back, having alluded to the potential utility of having the initial bib read running in the background several months ago. Basic concept demonstrated in reprex below. Thoughts: welcome!


# exclude things to make reprex faster
exclude = c("My Library", "energy-and-transport")

# no future
tictoc::tic()
b = citr::load_betterbiblatex_bib(encoding = "UTF-8", exclude_betterbiblatex_library = exclude)
#> Importing 'LIDA-leeds'...
#> Importing 'tds'...
plot(1:9)
tictoc::toc()
#> 0.58 sec elapsed
tictoc::tic()
# do some other work
class(b)
#> [1] "BibEntry" "bibentry"
tictoc::toc()
#> 0.002 sec elapsed

# with future
tictoc::tic()
future::plan("multiprocess")
b = future::future(citr::load_betterbiblatex_bib(encoding = "UTF-8", exclude_betterbiblatex_library = exclude))
plot(1:9)

tictoc::toc()
#> 0.085 sec elapsed
tictoc::tic()
# do some other work
b = future::value(b)
#> Importing 'LIDA-leeds'...
#> Importing 'tds'...
class(b)
#> [1] "BibEntry" "bibentry"
tictoc::toc()
#> 0.322 sec elapsed

^{Created on 2019-10-16 by the reprex package (v0.3.0)}

Robinlovelace commented 4 years ago

As a follow-on point, I've just tested out parsing files with the bib2df package and it seems fast.

Timings below on 2000+ .bib file FYI.

system.time({b = bib2df::bib2df("allrefs.bib")})
Some BibTeX entries may have been dropped.
            The result could be malformed.
            Review the .bib file and make sure every single entry starts
            with a '@'.
Column `YEAR` contains character strings.
              No coercion to numeric applied.
   user  system elapsed 
  2.098   0.003   2.112 
Warning message:
In bib2df_tidy(bib, separate_names) : NAs introduced by coercion
> nrow(b)
[1] 2755
> system.time({b2 = citr:::read_bib_catch_error("allrefs.bib")})
<simpleError in RefManageR::ReadBib(x, check = FALSE, .Encoding = encoding): argument "encoding" is missing, with no default>
   user  system elapsed 
  0.108   0.000   0.108 
> system.time({b2 = citr:::read_bib_catch_error("~/uaf/allrefs.bib", )})
x=         encoding=  
> system.time({b2 = citr:::read_bib_catch_error("~/uaf/allrefs.bib", "UTF-8")})
   user  system elapsed 
  7.179   0.093   7.272

Robinlovelace commented 4 years ago

Update: FYI I think the output from that package is not production ready yet. Just food for thought...

crsh commented 4 years ago

Hi Robin, thanks for sharing your results. This is actually one of the top two issues I want to tackle next. This looks promising.

Here are some of my thoughts on this. I think there are two major options here to speed up reading from Zotero:

Improve the current approach by possibly speeding up the reading of the bibliography file exposed by BBT by trying bib2df and using future or promises to enable loading the database in the background.

Have you, by chance, looked at promises? They seem to be an alternative to future, but I haven't fully understood the strengths of each approach to decided which way to go on this. bib2df also looks like a promising alternative to RefManageR and bibtex!

Search the Zotero database directly by using the BBT CAYW search (see below) and require users to use the pandoc-zotxt Lua filter with their R Markdown document format (e.g., using rmdfiltr). However, if I understand correctly, this would require installation of zotxt, another Zotero plugin.

I haven't tried zotxt and pandoc-zotxt, but if the bibliography export is fast(er than BBT), this could be the easiest and fastest way to address slow loading of the Zotero database. Hence, I'm leaning towards the second option. This would require some testing and some user interface considerations (would this be a separate addin or could it be integrated with the existing one?).

Just to link to the previous issue on background loading of the Zotero database: https://github.com/crsh/citr/issues/36

Robinlovelace commented 4 years ago

Not tried promises, in my experience bib2df is buggy. All approaches sound good, I'm excited for this new behaviour and happy to test anything you come up with. Many thanks.

crsh commented 4 years ago

After playing around with pandoc-zotxt a little I've come to understand that it requires the global pandoc variable PANDOC_STATE, which was introduced in pandoc 2.4. Currently, RStudio is shipping version 2.3.1, so I'll wait until they ship a newer version before starting to implement and test this.

crsh / citr

Better approaches to importing of the Zotero database #55