brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
908 stars 96 forks source link

JATS multiple authors #1648

Open rgieseke opened 3 years ago

rgieseke commented 3 years ago

I wonder whether author handling could be improved by using something like the below instead of splitting by space " ".

Assuming in LaTeX:

\author{Smith, Joe \and De Los Reyes, Carlos}

Would then give

<contrib contrib-type="author">
    <name>
      <given-names>Joe</given-names>
      <surname>Smith</surname>
    </name>
  </contrib>
  <contrib contrib-type="author">
    <name>
      <given-names>Carlos</given-names>
      <surname>De Los Reyes</surname>
    </name>
  </contrib>
</contrib-group>

Not sure whether the LaTeX example is general enough, though (found this recommended on StackOverflow). The current setup would merge the above into

<contrib-group>
  <contrib contrib-type="author">
    <name>
      <surname>Joe</surname>
      <given-names>Smith</given-names>
    </name>
  </contrib>
  <contrib contrib-type="author">
    <name>
      <surname>Carlos</surname>
      <given-names>DeLosReyes</given-names>
    </name>
  </contrib>
</contrib-group>

Proposed xsl:

<xsl:template match="ltx:personname">
    <name>
      <given-names>
        <xsl:for-each select="str:tokenize(./text(),',')">
          <xsl:if test="position()=last()">
            <xsl:value-of select="."/>
          </xsl:if>
        </xsl:for-each>
      </given-names>
      <surname>
        <xsl:for-each select="str:tokenize(./text(),',')">
          <xsl:if test="position()!=last()">
            <xsl:value-of select="."/>
          </xsl:if>
        </xsl:for-each>
      </surname>
    </name>
  </xsl:template>
rgieseke commented 2 years ago

I found this proposal here: https://tex.stackexchange.com/questions/4805/whats-the-correct-use-of-author-when-multiple-authors

However, when using 0.8.6 and only a single author is present, like:

\documentclass{article}

\author{Smith, Joe F.}

\title{Author test}

\begin{document}
\maketitle
A Test.
\end{document}

The trailing dot is removed, i believe this was introduced here: https://github.com/brucemiller/LaTeXML/pull/1628

<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/robert/Work/ems/tex-json"?>
<?latexml class="article"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Author test</title>
  <creator role="author">
    <personname>Smith, Joe F</personname>
  </creator>
  <para xml:id="p1">
    <p>A Test.</p>
  </para>
</document>

That's probably a separate issue of too strict sanitization?

rgieseke commented 2 years ago

Thinking a bit more about the original proposal above, maybe it's not possible to accommodate all ways to put an author into TeX? Maybe using <string-name> is a more secure way?

https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/string-name.html

Thinking of names like Abernathy, the Honorable Sir Edward Sammy Davis, Jr. there will probably always be edge cases.

brucemiller commented 2 years ago

Yeah, doing it in XSLT is already too late. JATS should really be working with a more BibTeX-close form, rather than trying to reverse engineer the formatted bibitem. That form does exist, if only momentarily, within LaTeXML's MakeBibliography, but (as does BibTeX) it conflates the extraction of needed bib entries with their formatting, so they never get exposed to the JATS stylesheet.

There's a complex PR #1231 which I'll be working on soon(!) and I hope that I can address preserving both the formatted & semantic forms in the process. This should allow improving the JATS bibliographies.

rgieseke commented 2 years ago

Oh yes, with the Bibitems it's probably even more complex, I was just thinking about the \author entries.

brucemiller commented 2 years ago

well, the authors are the most egregiously wrong part of LaTeXML's output :> But I assume that in the long-run, JATS wants as much of the semanic metadata as we can supply.