Revise citar-file-parser-triplet to allow commas in filenames

bdarcus commented 2 years ago

Apparently this happens, and will cause this parser to break when it does.

So we should figure out how to fix it to be more robust.

roshanshariff commented 2 years ago

I'm looking deeper into the file field parsing as part of the new cache design (#634). I've looked at the file fields produced by different programs:

Zotero with Better BibTeX produces semicolon-separated file fields with absolute file paths.
Recent Zotero with its built-in BibTeX exporter (#578) produces ;-separated file fields; each component is a :-separated triplet containing:
1. File "title" (as shown in Zotero UI). Any : or ; characters are backslash-escaped.
2. Either absolute file path or relative to the bib file, depending on whether the "include files" option is enabled while exporting
3. The file MIME type (like application/pdf or html)
Calibre's bib catalog export produces ,-separated file fields (not semicolon!); each component is a :-separated triplet with:
1. Empty string for the "title"
2. Absolute file path
3. Calibre format string like PDF or EPUB
Ebib uses a list of files separated by a customizable separator, ; by default (#599)

I don't have Mendeley or PaperPile (etc.), but it would be good to know what conventions they use.

It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its ,-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?

bdarcus commented 2 years ago

It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its ,-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?

Makes sense.

We already have citar-file-parser-functions. Could just set those to use more robust functions that cover a wider percentage of users, and anything else require user config.

Or force users to choose, with a simpler defcustom?

roshanshariff commented 2 years ago

I like the idea of making the default parser more robust, but requiring users to add more citar-file-parser-functions to handle oddball formats.

enbrown commented 1 year ago

Just for reference, Paperpile's files are stored without colons in the filenames. Here are some representative examples:

"All Papers/A/Aalen et al. 2010 - History of applications of martingales in survival analysis.pdf"
"All Papers/A/Ahmed et al. 2008 - Time-Varying Networks - Recovering Temporally Rewiring Genetic Networks During the Life Cycle of Drosophila melanogaster.pdf"
"All Papers/A/Ammari et al. 2014 - Mathematical modeling in full-field optical coherence elastography.pdf"
"All Papers/A/Arlot and Massart 2008 - Data-driven calibration of penalties for least-squares regression.pdf"
"All Papers/A/Armstrong 2012 - Non-detection of the Tooth Fairy at Optical Wavelengths.pdf"
"All Papers/B/Baayen et al. 2015 - Out of the Cage of Shadows.pdf"
"All Papers/B/Baghaie et al. 2014 - Sparse And Low Rank Decomposition Based Batch Image Alignment for Speckle Reduction of retinal OCT Images.pdf"
"All Papers/B/Baghaie et al. 2014 - State-of-the-Art in Retinal Optical Coherence Tomography Image Analysis.pdf"
"All Papers/B/Baillie 2008 - Summing the curious series of Kempner and Irwin.pdf"
"All Papers/B/Betancourt 2013 - Generalizing the No-U-Turn Sampler to Riemannian Manifolds.pdf"

When multiple files are associated with a reference, they are separated by a ; and all have the same directory prefix. I did find a few references that included semicolons in the article titles which were then included in the filenames. One was a 2018 paper whose title included a ;-delimited list: the file was All Papers/K/Kato et al. 2018 - Agreement among Goldmann applanation tonometer, iCare ... rs; non-contact tonometer; and Tonopen XL in healthy elderly subjects.pdf. Yikes!

So it seems reasonable that someone would be able to write a citar-file-parser-functions file that understands the Paperpile directory structure to split the file fields. (A minor complication is that Paperpile will, in the future, allow users to customize the organization and directory structure of their libraries.)

roshanshariff commented 1 year ago

@enbrown, thanks for looking into this! If I understand correctly, PaperPile uses semicolons to separate the filenames in the file fields, but then also has potential semicolons in the filename? You're suggesting using the directory prefix to disambiguate the field-separator semicolons from the in-filename semicolons, but this strikes me as very brittle...

If you're a PaperPile user, perhaps you could consider filing a bug to get the developers to escape semicolons and other special characters in filenames? So, when writing the file field in a BibTeX bibliography, special characters (at least semicolons and possibly also colons) should be preceded by a backslash. This is what Zotero and Calibre do, for example, and it makes everyone's life much easier. It's also already supported by the Citar file parser code. We only split apart file fields at unescaped semicolons, and automatically unescape the constituent filenames.

enbrown commented 1 year ago

@roshanshariff Yes, there are potential semicolons in the filenames. Given that the user now can control the filenames (and for how the files are organized into the Google Drive storage, see below), escaping the semicolon is probably a good idea. I've commented on the Paperpile forum. We'll see what comes of it.

If someone doesn't stray too far from the defaults, splitting on ; All Papers/ should work well. Perhaps this fix/hack can be put in the wiki once it's tested.

emacs-citar / citar

Revise citar-file-parser-triplet to allow commas in filenames #454