emacs-citar / citar

Emacs package to quickly find and act on bibliographic references, and edit org, markdown, and latex academic documents.
GNU General Public License v3.0
485 stars 53 forks source link

Files referred in file-field with ':' in their name and/or multiple files are parsed incorrectly #599

Open hpgisler opened 2 years ago

hpgisler commented 2 years ago

Describe the bug

An ebib entry referencing 2 files - named e.g. a:a.pdf and b.pdf are parsed incorrectly by the two citar functions:

There are 2 problems:

  1. a:a.pdf is interpreted exclusively as 'Calibre / Mendeley' format; however, this is not the case here
  2. the files separate ; in the ebib entry has a trailing ` (space) character, which is not removed by thesplit-string` function

To Reproduce

  1. create an ebib entry as follows:
    @Article{atakishiyev21:explain,
    file = {a:a.pdf; b.pdf},
    author = {Atakishiyev, Shahin and Salameh, Mohammad and Yao, Hengshuai and Goebel, Randy},
    journal = {arXiv e-prints},
    title = {{E}xplainable artificial intelligence for autonomous driving: {A}n overview  and guide for future research directions},
    }
  2. crate the two files in your according papers folder
  3. insert citation in some org doc.
  4. open/follow the link
  5. the two files are not shown for selection

Expected behavior

The two files should be shown / presented for selection

Emacs version:

"GNU Emacs 28.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.33, cairo version 1.17.6) of 2022-04-28"

bdarcus commented 2 years ago

I'll take a closer look later, but I think this is related to #578/#579.

cc @roryk

hpgisler commented 2 years ago

Following quck-hack-fix works for my use case:

    (defun hgi/citar-file-parser-arxiv (dirs file-field)
        "Return a list of files from DIRS and FILE-FIELD.
Works for
- ebib entries with multiple file entries;
- file names containing ':'
This is a 'quick-hack-fix' for citar bug:
https://github.com/bdarcus/citar/issues/599"
        (let ((files (split-string file-field "[;]" 'omit-nulls " ")))
            (delete-dups
             (seq-mapcat
                (lambda (dir)
                    (mapcar
                     (lambda (file)
                         (expand-file-name file dir)) files))
                dirs))))

Activated with: (setq citar-file-parser-functions '(hgi/citar-file-parser-arxiv))

bdarcus commented 2 years ago

What's the significance of arxiv WRT to the colon?

I wonder also if my suggestion to @roryk to not split the parsers was a mistake: https://github.com/bdarcus/citar/issues/578#issuecomment-1107474787.

E.g. in your case, if it only split on the semi-colon, you would have never run into the problem.

hpgisler commented 2 years ago

Regarding the colon's significance

The elpa package arxiv-citation (which I'm using) downloads papers from arXiv. It creates locally a pdf filename that corresponds to the paper's title.

Now, if the the title contains a colon, I run into said problem. E.g. https://arxiv.org/abs/2112.11561 creates following filename: atakishiyev-salameh-yao-goebel_explainable-artificial-intelligence-for-autonomous-driving:-an-overview--and-guide-for-future-research-directions.pdf

hpgisler commented 2 years ago

Regarding your suggestion to not split the parsers (https://github.com/bdarcus/citar/issues/578#issuecomment-1107474787)

One argument for not splitting might be, that traversing all the parsers might get expensive, in the case, where there are a lot of folders with papers to search through - in the order of: p_parsers * f_folders.

From SE perspective (decoupling) it perhaps would be better to have 1 parser per job.

bdarcus commented 2 years ago

Now, if the the title contains a colon, I run into said problem.

I wonder if it's worth a bug report to that package? Seems to me arxiv-citation-pdf-name should split the title on the colon (or on a question mark etc.), and only use the main title.

hpgisler commented 2 years ago

Hmm, as far as I understand, in this specific example, the part before and after the colon form the actual title of the paper. So I would assume that building the filename - including both parts - and the colon makes sense..?

On the other side, if a title would also include e.g. a semicolon, than it would definitively be problematic, as bib-latex uses the semicolon as well to separate multiple files in the file tag.

Perhaps replacing all those special characters in a title would be the way to go?

bdarcus commented 2 years ago

in this specific example, the part before and after the colon form the actual title of the paper.

No; the colon delimits title and subtitle. The title is just "Explainable artificial intelligence for autonomous driving".

Since that function already includes the author names to help disambiguate, I see no point in including the subtitle.

Perhaps replacing all those special characters in a title would be the way to go?

That's another option, but the file name in this example still ends up ridiculously and unnecessarily long.

In any case, the colon in the file name itself is arguably the bug.

hpgisler commented 2 years ago

Sorry to be obstinate about this one, but why do you think that the string past the colon is the subtitle?

In this specific case, arXiv's 'Export Bibtex Citation' yields:

@misc{https://doi.org/10.48550/arxiv.2112.11561,
  doi = {10.48550/ARXIV.2112.11561},
  url = {https://arxiv.org/abs/2112.11561},
  author = {Atakishiyev, Shahin and Salameh, Mohammad and Yao, Hengshuai and Goebel, Randy},
  keywords = {Artificial Intelligence (cs.AI), Computers and Society (cs.CY), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Explainable artificial intelligence for autonomous driving: An overview and guide for future research directions},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution 4.0 International}
}

(no subtitle nor titleaddon tags here)

However, I've now also opened an issue over at arxiv-citation

bdarcus commented 2 years ago

Sorry to be obstinate about this one, but why do you think that the string past the colon is the subtitle?

It's a convention so widely understood, at least in English-language scholarship, that I didn't think I needed to explain ;-)

https://style.mla.org/punctuation-with-titles/#:~:text=Titles%20and%20Subtitles&text=1%20of%20the%20eighth%20edition,of%20the%20title%20or%20subtitle.%E2%80%9D

no subtitle nor titleaddon tags here

There's lots of non-ideal bibliographic data.

bdarcus commented 2 years ago

To be clear, though, that string represents the full title, which is main title + subtitle.

bdarcus commented 2 years ago

Also related to #454

enbrown commented 1 year ago

It's a convention so widely understood, at least in English-language scholarship, that I didn't think I needed to explain ;-)

I respectfully disagree: at least in English-language medical and biochemistry literature (that I'm familiar with) the use of a colon in a title (where journal articles don't really have a concept of a subtitle) is terribly common. So a convention that might work well for books or other fields doesn't work everywhere. In my BibTeX file of over 5k references, over 1k have a title with a colon in it.

Thankfully my reference manager (Paperpile) didn't include any of them in the filename.

bdarcus commented 12 months ago

Should we close this, or is there some change we should make?