Open bdarcus opened 2 years ago
I'm looking deeper into the file field parsing as part of the new cache design (#634). I've looked at the file fields produced by different programs:
;
-separated file fields; each component is a :
-separated triplet containing:
:
or ;
characters are backslash-escaped.application/pdf
or html
),
-separated file fields (not semicolon!); each component is a :
-separated triplet with:
PDF
or EPUB
;
by default (#599)I don't have Mendeley or PaperPile (etc.), but it would be good to know what conventions they use.
It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its ,
-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?
It seems like a tricky problem to try to handle all these automatically. Calibre, especially, is an outlier with its
,
-separated fields. Perhaps that can be handled by a different parser that needs to be enabled?
Makes sense.
We already have citar-file-parser-functions
. Could just set those to use more robust functions that cover a wider percentage of users, and anything else require user config.
Or force users to choose, with a simpler defcustom?
I like the idea of making the default parser more robust, but requiring users to add more citar-file-parser-functions
to handle oddball formats.
Just for reference, Paperpile's files are stored without colons in the filenames. Here are some representative examples:
When multiple files are associated with a reference, they are separated by a ;
and all have the same directory prefix. I did find a few references that included semicolons in the article titles which were then included in the filenames. One was a 2018 paper whose title included a ;
-delimited list: the file was All Papers/K/Kato et al. 2018 - Agreement among Goldmann applanation tonometer, iCare ... rs; non-contact tonometer; and Tonopen XL in healthy elderly subjects.pdf
. Yikes!
So it seems reasonable that someone would be able to write a citar-file-parser-functions
file that understands the Paperpile directory structure to split the file fields. (A minor complication is that Paperpile will, in the future, allow users to customize the organization and directory structure of their libraries.)
@enbrown, thanks for looking into this! If I understand correctly, PaperPile uses semicolons to separate the filenames in the file fields, but then also has potential semicolons in the filename? You're suggesting using the directory prefix to disambiguate the field-separator semicolons from the in-filename semicolons, but this strikes me as very brittle...
If you're a PaperPile user, perhaps you could consider filing a bug to get the developers to escape semicolons and other special characters in filenames? So, when writing the file
field in a BibTeX bibliography, special characters (at least semicolons and possibly also colons) should be preceded by a backslash. This is what Zotero and Calibre do, for example, and it makes everyone's life much easier. It's also already supported by the Citar file parser code. We only split apart file fields at unescaped semicolons, and automatically unescape the constituent filenames.
@roshanshariff Yes, there are potential semicolons in the filenames. Given that the user now can control the filenames (and for how the files are organized into the Google Drive storage, see below), escaping the semicolon is probably a good idea. I've commented on the Paperpile forum. We'll see what comes of it.
If someone doesn't stray too far from the defaults, splitting on ; All Papers/
should work well. Perhaps this fix/hack can be put in the wiki once it's tested.
Apparently this happens, and will cause this parser to break when it does.
So we should figure out how to fix it to be more robust.