Juris-M / citeproc-js

A JavaScript implementation of the Citation Style Language (CSL) https://citeproc-js.readthedocs.io
Other
304 stars 85 forks source link

Markup in names prevent initialization #173

Closed bwiernik closed 3 years ago

bwiernik commented 3 years ago

I am trying to add in-line markup of names (to indicate student collaborators on my CV). Doing this with given names breaks initialization.

>>===== MODE =====>>
bibliography
<<===== MODE =====<<

>>===== RESULT =====>>
<div class="csl-bib-body">
  <div class="csl-entry"><b>Doe</b>, <b>J. Q. </b></div>
</div>
<<===== RESULT =====<<

>>===== CSL =====>>
<style 
      xmlns="http://purl.org/net/xbiblio/csl"
      class="in-text"
      version="1.0">
  <info>
    <id />
    <title />
    <updated>2009-08-10T04:49:00+09:00</updated>
  </info>
  <macro name="author">
     <names variable="author">
        <name name-as-sort-order="all" and="symbol" sort-separator=", " initialize-with=". " delimiter=", " delimiter-precedes-last="always"/>
     </names>
  </macro>
  <citation>
    <layout>
      <text macro="author"/>
    </layout>
  </citation>
  <bibliography>
    <layout>
      <text macro="author"/>
    </layout>
  </bibliography>
</style>

<<===== CSL =====<<

>>===== INPUT =====>>
[
    {
        "author": [
            {
                "family": "<b>Doe</b>",
                "given": "<b>John Quiggly</b>"
            }
        ],
        "id": "ITEM-1",
        "type": "book"
    }
]
<<===== INPUT =====<<

>>===== VERSION =====>>
1.0
<<===== VERSION =====<<
fbennett commented 3 years ago

Well, that's an interesting one. It should be permitted, I'll see what can be done to fix it.

fbennett commented 3 years ago

Looks a bit tricky to fix. The processor has a method for separating markup from a string, but it would return the string content as an array. If the markup does not cover the full span (i.e. if it's applied to only one element of the name), the array will have multiple elements. If those are concatenated for evaluation, we lose the information needed to re-apply the markup to the mangled string. I'll think about this, but I don't see an obvious solution. Suggestions welcome!

bwiernik commented 3 years ago

I’m having a little trouble wrapping my head around the cases. Could you give some examples?

fbennett commented 3 years ago

It's not so much actual use cases, but rather that, when parsing free text, any possible input will eventually be received as input, and should have known behavior. Finished teaching for the day this morning, and I have some ideas about it. We may be able to apply initialization to individual elements, when markup makes them discrete. Wil play with it a bit and post again later.

On Thursday, December 17, 2020, Brenton M. Wiernik notifications@github.com wrote:

I’m having a little trouble wrapping my head around the cases. Could you give some examples?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Juris-M/citeproc-js/issues/173#issuecomment-747183004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASMSTXQQMVF3L6K3ZD7QTSVF36RANCNFSM4U6FIQGQ .

bwiernik commented 3 years ago

I didn’t mean use cases, but rather cases where this applies:

If the markup does not cover the full span (i.e. if it's applied to only one element of the name), the array will have multiple elements. If those are concatenated for evaluation, we lose the information needed to re-apply the markup to the mangled string.

fbennett commented 3 years ago

Sorry. Things like this: <b>John</b> Paul Which the citeproc-js parser will split into:

strings: ["", "John", "Paul"]

Other implementations may have smarter ways of handling it, but the current citeproc-js magic for applying initials to a string assumes ... a string. We can apply the same function to the individual elements, but that breaks 32 of the current test fixtures. Making it smarter isn't easy. If it's a high priority feature, it might be worth floating the use case on CSL Discuss and running it past Cormac to be sure it's recognized by everyone.

fbennett commented 3 years ago

I may have a solution. If we recompose the string after splitting out tags, but save a list offsets for tag-insert points, we can then adjust the offsets in the code that performs abbreviation. The citeproc-js name formatting code isn't clean, and there may be complications that will foil the approach, but in theory ... we'll see how it works out tomorrow.

bwiernik commented 3 years ago

Okay, that makes sense. Let me know how it works out, otherwise I will post on Discourse and ask Cormac what he thinks.

fbennett commented 3 years ago

Seem to have it working. It's quite a tangle, and there are some small glitches. From <b>John</b>-Quiggly the initialization comes out as <b>J.-</b>Q. (capturing the hyphen), but maybe we can assume that use case away. I'll do some cleanup to reduce the impact of this on ordinary processing, then put up some tests for review.

fbennett commented 3 years ago

Unfortunately the final details are proving to be quite painful. In addition to abbreviations from a full names, we also need to handle normalization of existing abbreviations (i.e. H.L.A. Hart -> HLA Hart, or E M Forster -> E.M. Forster, depending on style settings). Something is slightly out of whack in the markup engine I've built for it, and it's not working in both modes. I'll have to give this a rest and come back to it sometime later. It's probably a simple thing, operations in the wrong sequence, or misuse of a counter: but at the moment I just can't see it. Time to back off.

In any case, this is a tough problem, and floating the decision to support or to not support markup in names input to the list and to Cormac would be a good idea.

bwiernik commented 3 years ago

Let me also see if pandoc has similar issues.

fbennett commented 3 years ago

A little more time spent on the issue convinced me that, at least in the current implementation of name abbreviation in citeproc-js, a robust implementation of in-field markup for names is not feasible. It would be possible to do something less ambitious, like applying one or more formatting elements to the entire name, but I'll just leave things as they are for now.

bwiernik commented 3 years ago

I think that applying the formatting to the entire name would cover nearly use cases. Would that be feasible to implement?

Beyond that, I can only imagine sub parts of a name having markup for institutional, not personal names.

fbennett commented 3 years ago

Stubbornness may win out. I have an idea for another approach. More later ...

On Sunday, December 20, 2020, Brenton M. Wiernik notifications@github.com wrote:

I think that applying the formatting to the entire name would cover >95% of use cases. Would that be feasible to implement?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Juris-M/citeproc-js/issues/173#issuecomment-748616175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASMSXCSNNJRWGDNGKFVB3SVYD5PANCNFSM4U6FIQGQ .

fbennett commented 3 years ago

Cracked it! Code is now passing all existing tests, and is confirmed to work for edge cases with markup. Will clean this up and push soon.

How did pandoc fare on this? Had it already been solved there?

fbennett commented 3 years ago

And done, at https://github.com/Juris-M/citeproc-js/commit/ce5ba68148ff7f8a86fb622c6dbbd2b4f6129ada, under release tag 1.4.54.

bwiernik commented 3 years ago

You are amazing! Didn't get a chance to investigate pandoc. Working on midtenure review packet

fbennett commented 3 years ago

Grabbed a static copy of pandoc and did a check. It escapes the markup, so this is displayed in the browser (i.e. the tags themselves show, and boldface is not applied):

denismaier commented 3 years ago

Have you tried using markdown instead?

bwiernik commented 3 years ago

With CSL YAML, pandoc silently drops both HTML and Markdown markup from names. With CSL JSON, it passes both through as literal text. Markup on regular fields is parsed if in the matching syntax (HTML-JSON / Markdown-YAML) and passed through otherwise.

fbennett commented 3 years ago

A Jurism release (5.0.93m15) with support for markup in name fields is now available.