acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
431 stars 288 forks source link

Correctly convert LaTeX to HTML #135

Closed davidweichiang closed 5 years ago

davidweichiang commented 5 years ago

Currently there are a few titles with LaTeX commands and special characters, which don't display correctly on the anthology website (for example, here.) This problem happens a lot more in abstracts, but as far as I can tell, those aren't displayed on the website.

@danielgildea has added latex.py presumably for this purpose, so I'm just creating an issue for it.

Some wrinkles to this issue:

danielgildea commented 5 years ago

One comment: unicode in bibtex files is generally not recommended because it breaks the sorting of author names.

Another gotcha is that in bibtex you have to say author = "J{\"u}rgen" or author = {J\"urgen} but not author = "J\"urgen"

that is why is used the "latex.py" rather than the "latexcodec" library

davidweichiang commented 5 years ago

It looks like JATS (https://jats.nlm.nih.gov/archiving/) uses <tex-math> for TeX formulas. Or <mml:math> for MathML, which actually doesn’t look so bad (Pandoc can convert to/from it, for instance).

mjpost commented 5 years ago

I'm happy to consider other proposals (since I could easily be missing something), but my vote is in favor of hosting properly-escaped raw BibTeX in the titles. While it makes sense to use UTF8 in the XML file, it might create some tricky conversion issues, especially when we encounter math, as you point out above. We are using XML as a more robust and extensible archival format for authoritative data, but we ingest from BibTeX, and BibTeX is our principal export format, so using raw BibTeX would avoid an error-prone double-conversion on titles. We would then need only worry about one conversion, when converting to MODS XML, which is used to generate the other citation formats.

mbollmann commented 5 years ago

I'm inclined to agree with Matt. If we interpret LaTeX in titles etc., that raises the need to constantly update the XML schema whenever new elements are introduced. For example, right now the schema allows <sup> because of a single paper that uses superscript "TM". And that's a simple case compared to, say, math formulas.

It would be easy to add BibTeX -> Unicode/HTML conversion to my Python wrapper in #133, and if we generated all non-BibTeX formats (website + MODS XML) from that wrapper, we could ensure consistency across all formats, and have a single place to fix any potential errors.

davidweichiang commented 5 years ago

My own inclination was to use Unicode outside of math formulas and TeX inside math formulas. For example, the superscript TM could have been represented by ™ (U+2122). That would have a simple schema. @mjpost's point about a pointless round-trip conversion stands, but on the other hand, consider that a minority of papers were never in LaTeX/BibTeX to begin with, and consider that arguably the main output format is HTML, not BibTeX.

But I agree that going all BibTeX is totally reasonable as well. It would require converting titles/abstracts not written using LaTeX/BibTeX so that they do, though.

knmnyn commented 5 years ago

I tend to agree with David on this. Going UTF-8/16 will allow charsets outside of what LaTeX handles to be faithfully rendered well without needing to worry about escapes in any meta markup. We now have papers that have emojis in their titles and we already handle HanZi for Chinese titles from ROCLING using UTF-8, so from a standards point of view we are already using it.

On Tue, Feb 12, 2019 at 7:52 PM David Chiang notifications@github.com wrote:

My own inclination was to use Unicode outside of math formulas and TeX inside math formulas. For example, the superscript TM could have been represented by ™ (U+2122). That would have a simple schema. @mjpost https://github.com/mjpost's point about a pointless round-trip conversion stands, but on the other hand, consider that a minority of papers were never in LaTeX/BibTeX to begin with, and consider that arguably the main output format is HTML, not BibTeX.

But I agree that going all BibTeX is totally reasonable as well. It would require converting titles/abstracts not written using LaTeX/BibTeX so that they do, though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/135#issuecomment-462730772, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP6_TIIXRUfdZmv6E9N9ctdeiUT8h2ks5vMqrrgaJpZM4azEiw .

mjpost commented 5 years ago

It does make sense to use the encoding most natural to the format—I hadn't though of Emojis in titles, for example. If we can have good library calls for the LaTeX ↔︎ XML conversions, that will reduce the problem to a big single effort (which may be mostly done) and maybe occasional handling of new items as they arise. UTF-8 and structured XML titles will also help us avoid situations where titles contain curly braces that aren't intended to protect case, for example.

I'm surprised there is only a single instance of <sup>: I'd expect to see exponents, for example, so this may just reflect the fact that most titles are still some mix of LaTeX and plain text. I think we could limit the set of tags we support in titles to a few small ones (emphasis, bold, fixed-case, sup, latex-equation), potentially expanding this here or there as the situations require, or otherwise just rendering them to plain text.

If we go this way, the TODO item then is to extend our library calls for the following to be aware of the new structured format:

(Edited for completeness based on conversation below, and to link to projects:)

davidweichiang commented 5 years ago

FWIW, before this conversation got going, I wrote a Python 3 script to convert XML+TeX to XML+Unicode. It uses the xml.etree module, so is likely to be more robust than anthologize.pl. It converts $...$ to ..., but doesn't yet convert {...} to ... or \emph{...} etc. to ....

In addition to the three bullets above, don't you need one for Anthology XML to HTML, or is that done via MODS XML?

mjpost commented 5 years ago

Yes, you're right: but this will be done by an XML → YAML script, which is then fed into Hugo to populate HTML template pages. @mbollmann has already written that.

mjpost commented 5 years ago

I am updating my comment above to reflect consensus view of the TODOs here. I have also linked to the projects and issues because I am having trouble keeping all these threads and tabs in my head.

jtrmal commented 5 years ago

-of-topic- (should be in #121, not #135)

mjpost commented 5 years ago

Hi Yenda—thanks for jumping in! Can you elaborate on what you mean by ingest? (ingest into what format)? I couldn't quite tell from your commit.

It will be worthwhile for you to sync up with @mbollmann. He is heading up the static rewrite. The current status is that he should have the project in a state more amenable to divvying out pieces in a week or so.

mbollmann commented 5 years ago

Re: the math rendering

I played around a bit with the <tex-math> instances and was not quite satisfied with any solution I could find. Pandoc mostly worked ok, but still failed to transform some instances that (in my opinion) have a reasonable representation in HTML (mostly involving \sqrt and \frac). Also, the contents of <tex-math> are mostly so simple that I was tempted to find out if I couldn't just write my own converter for them.

This is the result.

It takes care of subscripts/superscripts and converts math symbols to Unicode using this handy lookup table. If it can't interpret something, it is always preserved as a literal expression in the output, along with a warning during YAML generation. If you can access the beta site, you can see examples of generated output in C16-1261 or W17-1912.

Now, I'm not sure what your opinion is on the maintainability of a homebrew solution like this, and I'm totally willing to switch over to pandoc instead if you'd feel more comfortable with that. But I think it works pretty well and should be pretty robust. Opinions wanted. :)

davidweichiang commented 5 years ago

Did you look at tex4ht? Is that better or worse than pandoc? How about using pandoc to convert to MathML?

Can we fall back to MathJax instead of raw LaTeX?

On Mar 5, 2019, at 07:18, Marcel Bollmann notifications@github.com wrote:

Re: the math rendering

I played around a bit with the instances and was not quite satisfied with any solution I could find. Pandoc mostly worked ok, but still failed to transform some instances that (in my opinion) have a reasonable representation in HTML (mostly involving \sqrt and \frac). Also, the contents of are mostly so simple that I was tempted to find out if I couldn't just write my own converter for them.

This is the result.

It takes care of subscripts/superscripts and converts math symbols to Unicode using this handy lookup table. If it can't interpret something, it is always preserved as a literal expression in the output, along with a warning during YAML generation. If you can access the beta site, you can see examples of generated output in C16-1261 or W17-1912.

Now, I'm not sure what your opinion is on the maintainability of a homebrew solution like this, and I'm totally willing to switch over to pandoc instead if you'd feel more comfortable with that. But I think it works pretty well and should be pretty robust. Opinions wanted. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mbollmann commented 5 years ago
davidweichiang commented 5 years ago

Ouch, line 17. I had been thinking classifying dollar signs as math delimiters vs. currency symbols and thought surely no string would use a currency symbol twice? I am quickly learning that such assumptions are always wrong.

On 2019-03-05, at 09:09, Marcel Bollmann notifications@github.com wrote:

tex4ht falls back to generating images when it can't represent pure HTML, however it already does this for simple cases like \frac{m}{n}, which I believe can reasonably be represented as m⁄n. Also, invoking it and parsing the resulting HTML for the formula is quite involved for the trivially simple formulas we need to process. But yes, due to the image fallback I would actually prefer it to pandoc, I think.

MathML is not widely supported yet in browsers https://caniuse.com/#feat=mathml as far as I understand it, and therefore should always be combined with MathJax.

We could fall back to MathJax, but I really feel it isn't necessary at this point. Since we're only dealing with math inside titles and abstracts, the expected complexity of formulas is really minimal. For reference, here's the full list of formulas currently in the XML files https://gist.github.com/mbollmann/52a7aa9f2392b6008d75bbb0f24f817a.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/135#issuecomment-469692832, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVcac6nW4jNQ8BGlSYvQ5K_pDDvh-e2ks5vTnqLgaJpZM4azEiw.

knmnyn commented 5 years ago

Hi all:

It's great to try to automate everything but perhaps it's also good to have a human in the loop to check and correct some cases manually by hand. It may be better for the scripts to flag to STDERR any suspicious cases, and handle the 90-99% of normal case work. Just a thought -- of course, that is what got the Anthology in the messy state that it was before Matt's directorship.

On Tue, Mar 5, 2019 at 10:16 PM David Chiang notifications@github.com wrote:

Ouch, line 17. I had been thinking classifying dollar signs as math delimiters vs. currency symbols and thought surely no string would use a currency symbol twice? I am quickly learning that such assumptions are always wrong.

On 2019-03-05, at 09:09, Marcel Bollmann notifications@github.com wrote:

tex4ht falls back to generating images when it can't represent pure HTML, however it already does this for simple cases like \frac{m}{n}, which I believe can reasonably be represented as m⁄n. Also, invoking it and parsing the resulting HTML for the formula is quite involved for the trivially simple formulas we need to process. But yes, due to the image fallback I would actually prefer it to pandoc, I think.

MathML is not widely supported yet in browsers < https://caniuse.com/#feat=mathml> as far as I understand it, and therefore should always be combined with MathJax.

We could fall back to MathJax, but I really feel it isn't necessary at this point. Since we're only dealing with math inside titles and abstracts, the expected complexity of formulas is really minimal. For reference, here's the full list of formulas currently in the XML files < https://gist.github.com/mbollmann/52a7aa9f2392b6008d75bbb0f24f817a>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/acl-org/acl-anthology/issues/135#issuecomment-469692832>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ABVcac6nW4jNQ8BGlSYvQ5K_pDDvh-e2ks5vTnqLgaJpZM4azEiw .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/135#issuecomment-469695416, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP6xh8pophs3fhMG2Dsfzv7d-2W4Hxks5vTnxDgaJpZM4azEiw .

davidweichiang commented 5 years ago

I'm closing this since it was decided to move LaTeX->HTML to ACLPUB.