Open mjpost opened 3 years ago
Do we know which papers are producing the errors? (If not, better error reporting should be a to-do as well.)
To do:
\sim9k
.$super long text$
should not be parsed as math (it's probably just dollar signs)\text
, \textrm
, etc.\rm
, \cal
, etc. (but nothing we can do about it if used incorrectly, which is usually)a_\alpha
doesn't subscript the alpha correctlyNo, I just grepped in the XML dir a bit.
I can't find where the actual error is coming from (grep TeX-math
turns up nothing) -- does anyone know?
Thanks. It's suboptimal that we have two homegrown LaTeX parsers in the codebase (anthology/texmath.py
and latex_to_unicode.py
), and also that one of them depends on anthology/latexcodec.py
and the other depends on the public latexcodec library.
@mbollmann I think I need your help with some of the to-dos above.
tex-math errors should have line numbers
Adding line numbers is tricky with the current implementation. I didn't think they were necessary as I didn't expect us to ever want to change the TeX in the actual abstracts (is that really a good idea?), but rather maybe add support to the code for the unsupported expressions when they come up.
parsing latex should know that numbers aren't part of control sequences, e.g., `\sim9k`.
That looks like a bug in TexSoup, though it means we probably have to handle this manually.
* [ ] Allow `\text`, `\textrm`, etc.
That can be added quite easily to TexMath._parse_command
, I can do that (I'd just need to check what CSS styling it should get).
* [ ] Allow `\rm`, `\cal`, etc. (but nothing we can do about it if used incorrectly, which is usually)
That's probably out of the scope of what our parser should be expected to handle.
* [ ] `a_\alpha` doesn't subscript the alpha correctly
Right. The current solution for that is unfortunately quite ugly as TexSoup doesn't handle that (and I don't know of a TeX parsing library that does) :( ... I can look into making it smarter.
Builds complain with the following errors. Can we: