emacs-citar / citar

Emacs package to quickly find and act on bibliographic references, and edit org, markdown, and latex academic documents.
GNU General Public License v3.0
514 stars 55 forks source link

Revise citar-file-parser-triplet to allow commas in filenames #454

Open bdarcus opened 2 years ago

bdarcus commented 2 years ago

Apparently this happens, and will cause this parser to break when it does.

So we should figure out how to fix it to be more robust.

roshanshariff commented 2 years ago

I'm looking deeper into the file field parsing as part of the new cache design (#634). I've looked at the file fields produced by different programs:

I don't have Mendeley or PaperPile (etc.), but it would be good to know what conventions they use.

It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its ,-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?

bdarcus commented 2 years ago

It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its ,-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?

Makes sense.

We already have citar-file-parser-functions. Could just set those to use more robust functions that cover a wider percentage of users, and anything else require user config.

Or force users to choose, with a simpler defcustom?

roshanshariff commented 2 years ago

I like the idea of making the default parser more robust, but requiring users to add more citar-file-parser-functions to handle oddball formats.

enbrown commented 1 year ago

Just for reference, Paperpile's files are stored without colons in the filenames. Here are some representative examples:

When multiple files are associated with a reference, they are separated by a ; and all have the same directory prefix. I did find a few references that included semicolons in the article titles which were then included in the filenames. One was a 2018 paper whose title included a ;-delimited list: the file was All Papers/K/Kato et al. 2018 - Agreement among Goldmann applanation tonometer, iCare ... rs; non-contact tonometer; and Tonopen XL in healthy elderly subjects.pdf. Yikes!

So it seems reasonable that someone would be able to write a citar-file-parser-functions file that understands the Paperpile directory structure to split the file fields. (A minor complication is that Paperpile will, in the future, allow users to customize the organization and directory structure of their libraries.)

roshanshariff commented 1 year ago

@enbrown, thanks for looking into this! If I understand correctly, PaperPile uses semicolons to separate the filenames in the file fields, but then also has potential semicolons in the filename? You're suggesting using the directory prefix to disambiguate the field-separator semicolons from the in-filename semicolons, but this strikes me as very brittle...

If you're a PaperPile user, perhaps you could consider filing a bug to get the developers to escape semicolons and other special characters in filenames? So, when writing the file field in a BibTeX bibliography, special characters (at least semicolons and possibly also colons) should be preceded by a backslash. This is what Zotero and Calibre do, for example, and it makes everyone's life much easier. It's also already supported by the Citar file parser code. We only split apart file fields at unescaped semicolons, and automatically unescape the constituent filenames.

enbrown commented 1 year ago

@roshanshariff Yes, there are potential semicolons in the filenames. Given that the user now can control the filenames (and for how the files are organized into the Google Drive storage, see below), escaping the semicolon is probably a good idea. I've commented on the Paperpile forum. We'll see what comes of it.

If someone doesn't stray too far from the defaults, splitting on ; All Papers/ should work well. Perhaps this fix/hack can be put in the wiki once it's tested.

image