acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
424 stars 283 forks source link

tex-math build warnings #1244

Open mjpost opened 3 years ago

mjpost commented 3 years ago

Builds complain with the following errors. Can we:

  1. Manually fix them
  2. (Time-permitting) update the ingestion code to handle them better? (This is less important and I don't want it to be a blocker, but maybe some of these are easy)
WARNING  Unknown TeX-math command: \sim9k
WARNING  Unknown TeX-math command: \cal
WARNING  Unknown TeX-math command: \#Turki
WARNING  Unknown TeX-math command: \Pr
WARNING  Unknown TeX-math command: \rm{P}
WARNING  Unknown TeX-math command: \rm{Y}
WARNING  Unknown TeX-math command: \text{Petersen}
WARNING  Unknown TeX-math command: \text{CAPTURE}
WARNING  Unknown TeX-math command: \text{Schnabel}
davidweichiang commented 3 years ago

Do we know which papers are producing the errors? (If not, better error reporting should be a to-do as well.)

davidweichiang commented 3 years ago

To do:

mjpost commented 3 years ago

No, I just grepped in the XML dir a bit.

davidweichiang commented 3 years ago

I can't find where the actual error is coming from (grep TeX-math turns up nothing) -- does anyone know?

mjpost commented 3 years ago

Here I think

davidweichiang commented 3 years ago

Thanks. It's suboptimal that we have two homegrown LaTeX parsers in the codebase (anthology/texmath.py and latex_to_unicode.py), and also that one of them depends on anthology/latexcodec.py and the other depends on the public latexcodec library.

davidweichiang commented 3 years ago

@mbollmann I think I need your help with some of the to-dos above.

mbollmann commented 3 years ago
 tex-math errors should have line numbers

Adding line numbers is tricky with the current implementation. I didn't think they were necessary as I didn't expect us to ever want to change the TeX in the actual abstracts (is that really a good idea?), but rather maybe add support to the code for the unsupported expressions when they come up.

parsing latex should know that numbers aren't part of control sequences, e.g., `\sim9k`.

That looks like a bug in TexSoup, though it means we probably have to handle this manually.

* [ ]  Allow `\text`, `\textrm`, etc.

That can be added quite easily to TexMath._parse_command, I can do that (I'd just need to check what CSS styling it should get).

* [ ]  Allow `\rm`, `\cal`, etc. (but nothing we can do about it if used incorrectly, which is usually)

That's probably out of the scope of what our parser should be expected to handle.

* [ ]  `a_\alpha` doesn't subscript the alpha correctly

Right. The current solution for that is unfortunately quite ugly as TexSoup doesn't handle that (and I don't know of a TeX parsing library that does) :( ... I can look into making it smarter.