andras-simonyi / citeproc-el

A CSL 1.0.2 Citation Processor for Emacs.
GNU General Public License v3.0
85 stars 9 forks source link

Add links on doi, pmid, etc. #22

Closed bdarcus closed 3 years ago

bdarcus commented 3 years ago

Another non-essential-nice-to-have feature to consider.

See this commit.

The option would just allow this in HTML export:

https://doi.org/10.1525/aa.2001.103.1.45

... or this:

doi:10.1525/aa.2001.103.1.45

.... to be:

<a href="https://doi.org/10.1525/aa.2001.103.1.45" class="biblink">https://doi.org/10.1525/aa.2001.103.1.45</a>

... or:

<a href="https://doi.org/10.1525/aa.2001.103.1.45" class="biblink">doi:10.1525/aa.2001.103.1.45</a>
andras-simonyi commented 3 years ago

In theory, citproc-el automatically links the content of URL fields this way in its HTML and LaTeX output. Have you encountered different behaviour?

bdarcus commented 3 years ago

OK, great. But it doesn't work for dois currently; right? I get this in HTML export, without a link.

https://doi.org/10.1525/aa.2001.103.1.45
andras-simonyi commented 3 years ago

No, dois are not linked in HTML output.

bdarcus commented 3 years ago

OK, revised the title and description to narrow it.

The doi issue in general threw me for a loop, and I just noticed this while trying to figure that out.

bdarcus commented 3 years ago

What's odd is you have a doi in regular text, org-mode will export it linked in html.

bdarcus commented 3 years ago

BTW, I noticed there's this, but no equivalent for CSL.

(defcustom org-cite-biblatex-options nil
titaniumbones commented 3 years ago

EDIT:

I'd love to submit a PR to change this behaviour but haven't been able to understand the code well enough to instrument it in an efficient way. Here's what I've tried so far:

titaniumbones commented 3 years ago

also: not sure if this deserves its own issue, but I would love to move the URL from visible text in my style to a simple <a> link around the title, as described in this comment on citeproc's issue 76, referenced above. Presumably this would be accomplished with some kind of (rendered-var . title) lambda in bibtex-formatters, but I'm not quite sure how to do it in the existing citeproc.el structure, which uses a bunch of more advanced lisp stuff that makes my head hurt.

bdarcus commented 3 years ago

As in #25? See, in particular: https://github.com/andras-simonyi/citeproc-el/issues/25#issuecomment-855225435.

titaniumbones commented 3 years ago

As in #25? See, in particular: #25 (comment).

huh, actually no, though that is super cool. I want instead something like

McNeil, Donald. 2011. "<a` href="https://foo.org/bar">Fine Grain, Global City...</a>." etc. etc. http://dx.doi.org/1023812.123.3423

which is e.g. how Zotpress manages linking, if I'm understanding that citeproc issue properly.

titaniumbones commented 3 years ago

@andras-simonyi any pointers on how I might try to amend the code to make DOI linkable? And is this something you would support?

andras-simonyi commented 3 years ago

Could you specify the required behavior? If the values of the DOI and PMID fields are [doi] and [pmid] then what should the link targets be exactly? https://doi.org/[doi] for DOI? And what about PMIDs? Or should we make these mappings configurable?

bdarcus commented 3 years ago

Here's the PR for the appendix we added to the CSL spec. Is this helpful?

https://github.com/citation-style-language/documentation/pull/86

titaniumbones commented 3 years ago

I think there is some complexity here as the style itself may add the http:// prefix. e.g., here is apa.csl:

  <macro name="access">
    <choose>
      <if variable="DOI" match="any">
        <text variable="DOI" prefix="https://doi.org/"/>
      </if>  
     .......

    </choose>
  </macro>

I personally would prefer to have a defcustom for this, e.g.

(defcustom citeproc-doi-prefix "https://doi.org/"
  "URL prefix for DOI fields in exported bibliography entries")

and then on export replace the DOI with

<a href="{citeproc-doi-prefix}{DOI}">{DOI}</a>

or similar.

Does that sound right to you, Bruce? Do you know if this will screw things up with styles such as APA?

titaniumbones commented 3 years ago

(also, I don't really work with PMID so have nothing to say about that unfortunately).

bdarcus commented 3 years ago

I think there is some complexity here as the style itself may add the http:// prefix. e.g., here is apa.csl:

Isn't that orthogonal to actual linking though?

That code determines what will be rendered; not what the link target is.

The PR, OTOH, mostly focuses on what the target should be.

Does that sound right to you, Bruce? Do you know if this will screw things up with styles such as APA?

I'm not really sure.

The idea of that PR was let's see how far we can get with some basic rules, implemented in processors, with the expectation it would cover most use cases.

I'd like see what @andras-simonyi thinks about this. There's always room to iterate those recommendations.

titaniumbones commented 3 years ago

I think there is some complexity here as the style itself may add the http:// prefix. e.g., here is apa.csl:

Isn't that orthogonal to actual linking though?

That code determines what will be rendered; not what the link target is.

I don't really know enough about where citeproc.el interfaces with the stylesheet. From what I can tell, the lambda functions in citeproc-fmt--html-alist are responsible for that, and I imagine they would have to transform the field before the stylesheet acts on it. If that's the case, then the instruction

<text variable="DOI" prefix="https://doi.org/"/>

would potentially end up producing a string like https://doi.org<a href="https://doi.org/XXX.XXX.XX>https://doi.org/XXX.XXX.XX<a/>

Is this a legitimate worry?

I would test it out but there's something I clearly don't understand, as the modifications I have tried to make to citeproc-formatters.el seem to have no effect.

bdarcus commented 3 years ago

I don't know this code, but from a CSL perspective, the snippet you show above just says to output DOIs as:

https://doi.org/XXX.XXX.XX

The new spec recommendation suggests doing this in HTML:

<a href="https://doi.org/XXX.XXX.XX">https://doi.org/XXX.XXX.XX</a>

So for sake of argument, if the above CSL fragment did not have a prefix, the result would be:

<a href="https://doi.org/XXX.XXX.XX">XXX.XXX.XX</a>
andras-simonyi commented 3 years ago

Thanks for all the comments, they are very useful! I've created a new branch 22-add_links for this feature and already added linking rendered DOIs, PMIDs and PMCIDs. Could you perhaps test whether the code does what it should? (It might also help in understanding the formatter "framework".) What is missing is to link the title if no explicit URL etc is present -- I intend to add that in the next 1-2 days.

bdarcus commented 3 years ago

Could you perhaps test whether the code does what it should?

As you know, I've had some issues getting these feature branches working together as I expect (for example, dates still go missing with json).

I tried to test this branch, but the output isn't including any dois, which maybe means I have something wrong with my data files, that I don't see ATM. Or some issue with the setup.

The code looks right though.

WDYT?

bdarcus commented 3 years ago

Oh wait, oddly, it works if I use bib files.

Which turns up an issue. This is what I get now:

https://doi.org/
<a href="https://doi.org/10.1111/j.1468-2427.2010.01027.x">10.1111/j.1468-2427.2010.01027.x</a>

I think it should be:

<a href="https://doi.org/10.1111/j.1468-2427.2010.01027.x">https://doi.org/10.1111/j.1468-2427.2010.01027.x</a>

Right? You're putting the link-prefix before the link ATM.

Edit: now I'm not sure sure. Does it depend where that prefix is coming from?

titaniumbones commented 3 years ago

I also had a little trouble testing -- I'm not sure exactly why, but I think maybe because the lexical binding of citeproc-fmt--formatters-alist means that its value doesn't actually update when the library is reloaded.? In any case, I wish that it were possible to create the formatters dynamically in case the value of the citeproc-fmt--html-alist or citeproc-fmt--latex-alist changes after the library is loaded, e.g., perhaps in a let-binding during export. .

Anyway, this is what i get (similar to Bruce):

<div class="csl-entry"><a name="citeproc_bib_item_4"></a>Zubaida, S. (2009). The idea of “Indian food”, between the colonial and the global. <i>Food &amp; History</i>, <i>7</i>(1), 191–209. https://doi.org/<a href="10.1484/j.food.1.100643">10.1484/j.food.1.100643</a></div>

Bruce, I don't think the issue here is with citeproc-el, but with CSL itself -- the style returns a text string with https://doi.org/ prepended to the DOI. citeproc-el can only modify the value of the DOI itself, if I understand correctly. So to get the output you wnat (and which is obviously preferable) you'd either have to modify the style or do a regexp search-and-replace on the rendered citation itself. I suck at regex so that would likely take me many hours; a partial solution that works for doi links at least is:

(defun citeproc-fmt--html-bib-formatter (items _bib-format)
  "Return a html bibliography from already formatted ITEMS."
  (concat "<div class=\"csl-bib-body\">\n"
      (mapconcat (lambda (i)
               (concat "  <div class=\"csl-entry\">"
                               (replace-regexp-in-string
                                "\\(https://doi.org/\\)\\(<a href=\".+?\">\\)\\(.+?\\)\\(</a>\\)" 
                                "\\2\\1\\3\\4"
                                i )
                               "</div>\n"))
             items
             "")
      "</div>"))

Is there a better solution though?

titaniumbones commented 3 years ago

freaking emacs regexp backslash tax is killing me, gotta say.

titaniumbones commented 3 years ago

how about this, actually:

(defun citeproc-fmt--html-bib-formatter (items _bib-format)
  "Return a html bibliography from already formatted ITEMS."
  (concat "<div class=\"csl-bib-body\">\n"
      (mapconcat (lambda (i)
               (concat "  <div class=\"csl-entry\">"
                               (replace-regexp-in-string
                                "\\(https://doi.org/\\|http://www.ncbi.nlm.nih.gov/pubmed/\\|http://www.ncbi.nlm.nih.gov/pmc/articles/\\)\\(<a href=\".+?\">\\)\\(.+?\\)\\(</a>\\)"
                                ;;"\\(https://doi.org/\\)\\(<a href=\".+?\">\\)\\(.+?\\)\\(</a>\\)" 
                                "\\2\\1\\3\\4"
                                i )
                               "</div>\n"))
             items
             "")
      "</div>"))

It covers all the URL prefixes I found in the csl styles repository. It doesn't link the PMID:, or PMCID prefixes in some journal formats. I think this may be a weakness of the new CSL recommendation, or anyway an indication that it's incompatible with some styles? What do you think, Bruce?

Edit: still feels pretty ad hoc, though.

bdarcus commented 3 years ago

Bruce, I don't think the issue here is with citeproc-el, but with CSL itself -- the style returns a text string with https://doi.org/ prepended to the DOI.

I'm actually still not sure on what is responsible how; only that the output it not ideal.

titaniumbones commented 3 years ago

I'm actually still not sure on what is responsible how; only that the output it not ideal.

Did you try my fix above?

bdarcus commented 3 years ago

Did you try my fix above?

No, I didn't.

It covers all the URL prefixes I found in the csl styles repository. It doesn't link the PMID:, or PMCID prefixes in some journal formats. I think this may be a weakness of the new CSL recommendation, or anyway an indication that it's incompatible with some styles?

Well, like I said, it was a first cut at a recommendation. So it could be that this experience suggests revisions.

But to go back to the basic CSL -> output logic.

Let's assume this CSL:

Case 1

<text variable="DOI" prefix="doi: "/>

Case 2

<text variable="DOI" prefix="https://doi.org/"/>

So we have one of two choices:

  1. say the prefix is included in the link, which would mean in Case 1 <a href="https://doi.org/XXX.XXX.XX">doi: XXX.XXX.XX</a>.
  2. say the prefix precedes the link, unless the prefix preceding the link is in fact a base URL, in which case do 1

Right?

titaniumbones commented 3 years ago

So we have one of two choices:

1. say the prefix is included in the link, which would mean in Case 1 `<a href="https://doi.org/XXX.XXX.XX">doi: XXX.XXX.XX</a>`.

2. say the prefix precedes the link, unless the prefix preceding the link is in fact a base URL, in which case do 1

Right?

Well, I think both you and Andras will understand this a lot better than me, but as I understand it, the various functions in citeproc-fmt--html-alist are only able to act directly on the underlying values in the dataset. The CSL transformations themselves rae out of scope; when we have something like

<text value="doi:"/>
<text variable="DOI"/>

citeproc-el has no way to know about the "doi:" in advance. So, the only way to fix it is some post-hoc manipulation, I think.

though maybe I misunderstand. Working my way thorugh citeproc-rt right now.

titaniumbones commented 3 years ago

It's a bit beyond me but it seems like the solution might be in citeproc-rt-render-affixes (citeproc-rt, line 413). I'm not sure it's entirely simple to adjust. Lines 437-442 or so look particularly relevant.

I wish the lambda functions in citeproc-fmt--html-alist had access to the whole citation object somehow. But maybe they do and I just don't know.

andras-simonyi commented 3 years ago

Thanks again for the comments! @titaniumbones: just a short explanation of what is going on code-wise: as you have seen, the library uses its own internal rich text representation, and, ideally, formatters should act only as (relatively) simple converters to produce the actual outputs in the required formats. The formatter API has two layers: a low-level one, which is very general: the only assumption made about the RT component of a citeproc-formatter struct is that it maps rich texts to their properly formatted version, and this function is called with the whole rich text formatted bibliography entry. The other, "convenience layer" of the API provides some functions that -- hopefully -- make creating this function easier, because they require only a mapping from rich text attribute-value pairs to corresponding text manipulation functions.

Because of the above setup, it seems to me that, ideally, linking should happen already at the internal rich text representation level, so I will try to change the code this way.

As for the trouble with prefixes that add URI components to some variable values: I see one potential problem with the recommendation as it is. If the URI resulting from adding the prefix is different from the recommended target then we end up linking an URI anchor to a different target which is rather confusing IMHO.

titaniumbones commented 3 years ago

I'll look forward to seeing what you produce -- hopefully once you have an elegant solution I will be able to study it to understand a little better what is going on.

As I have worked on this I realize that one end result I would very much like to be able to produce is an informational tooltip or title attribute that can be accessed by hovering over the citation, much as oc-basic's activate echoes a semi-formatted citation to the minibuffer. It seems clear to me that that should not be citeproc's job, but I wonder if there's anything in the architecture that can be leveraged by a processor (ox-csl or something derived from it) to make such features easier to implement. For instance,is there a finishing stage at which the citation can get access to the final bibliography entr[y|ies] and potentially build a tooltip out of them; or could the parsed bibliography entry text just be joined by spaces to create a title attribute on the citatlion link? Etc.

I am shocked that we have finally gotten to this stage after so many years of half-baked solutions. I think it will revolutionize my writing, actually.

bdarcus commented 3 years ago

As I have worked on this I realize that one end result I would very much like to be able to produce is an informational tooltip or title attribute that can be accessed by hovering over the citation, much as oc-basic's activate echoes a semi-formatted citation to the minibuffer.

https://github.com/andras-simonyi/org-cite-csl-activate

If you hover on a citation, you will get a pop-up with a mini-bibliography.

There was talk of including that functionality in oc-csl; not sure what the status of that is.

bdarcus commented 3 years ago

I see one potential problem with the recommendation as it is. If the URI resulting from adding the prefix is different from the recommended target then we end up linking an URI anchor to a different target which is rather confusing IMHO.

If you have suggestions for amendments for that language based on this work, please let me know. Maybe best would be a PR, or an issue here:

https://github.com/citation-style-language/documentation

andras-simonyi commented 3 years ago

I think I've implemented (still in the feature branch) everything required by Appendix VI -- I'd appreciate if you could try it out and comment. Thanks!

bdarcus commented 3 years ago

I did a quick test with chicago-note and apa and looks good to me @andras-simonyi.

Thanks.

So do you conclude we need to amend that recommendation?

titaniumbones commented 3 years ago

it looks awesome to me.

I haven't looked through the code yet, but am I right that there is supposed to be a case in which the URL gets wrapped around the title? Do you know of a style I should try out in order to test this?

On 7/7/21 11:22 AM, bdarcus wrote:

EXTERNAL EMAIL:

I did a quick test with chicago-note and apa and looks good to me @andras-simonyi https://github.com/andras-simonyi.

Thanks.

So do you conclude we need to amend that recommendation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andras-simonyi/citeproc-el/issues/22#issuecomment-875697296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACY6NHIC626WRQHH5T5ENDTWRWL3ANCNFSM4522CSLA.

andras-simonyi commented 3 years ago

I haven't looked through the code yet, but am I right that there is supposed to be a case in which the URL gets wrapped around the title? Do you know of a style I should try out in order to test this?

Yes, I managed to test it with the apa-no-doi-no-issue style in the repository.

andras-simonyi commented 3 years ago

@bdarcus wrote:

So do you conclude we need to amend that recommendation?

Well, looking at some of the styles it's difficult to tell. I find it problematic to link a well formed URL string as anchor to a different URL but it appears to be the safest option. Some of the styles I've seen add URL prefixes to DOIs which don't make much sense IIRC.

andras-simonyi commented 3 years ago

Meanwhile I've merged the associated feature-branch into master, so planning to close this issue soonish.