TU-Berlin / project-mlp

a machine learning approach for processing mathematical language in scientific documents
0 stars 1 forks source link

Complete unicode2tex map #15

Closed physikerwelt closed 9 years ago

physikerwelt commented 9 years ago

LaTeXML uses a log of unicode in the MathML output. To convert it back to \TeX we started with a unicode->tex map to convert unicode to tex. However this map is not complete. For example \mathbb{r} was missing https://github.com/TU-Berlin/project-mlp/blob/master/mlp/src/main/java/mlp/text/UnicodeMap.java#L2349 which might probably have been generated here https://github.com/brucemiller/LaTeXML/blob/be53d257cbfccad7858fe8261ff39aabf0283c75/lib/LaTeXML/Package/bbold.sty.ltxml

@brucemiller,@dginev is there a list of all unicode symbols used by LaTeXML?

dginev commented 9 years ago

I think no list, they are sprinkled around around the codebase. You'd have to look for DefMath(, DefMathLigature( and DefLigature( as three macros that define UTF-8 mappings.

brucemiller commented 9 years ago

On 09/30/2015 12:36 PM, Moritz Schubotz wrote:

LaTeXML uses a log of unicode in the MathML output. To convert it back to \TeX we started with a unicode->tex map to convert unicode to tex.

If you're trying to convert LaTeXML's MathML back to TeX, you know it's already stored on the alttext attribute of m:math, and on the tex attribute of ltx:XMath, right?

If you're converting MathML from other sources, I'd be surprised if there isn't an XSLT floating around the web somewhere (likely written by David Carlisle, if so).

As to the other part: As Deyan says, there's no list in LaTeXML.

physikerwelt commented 9 years ago

Thanks for your reply. I'm converting back from different sources that also include LaTeXML generated MathML. The current utf-8 map that I found on the web already includes more than 2000 unicode to tex mappings. The goal of this project is not to entirely convert back to tex but to convert back all mathematical identifiers. While I have to admit, that it's not clear what an mathematical identifier is, we started with all mi elements as a zeroth order approximation. To capture all utf-8 symbols generated by latexml I'll 1) grep the LaTeXML code 2) generate a list with all UTF-8 symbols 3) write tests that generate only one particular utf-8 symbol from TeX input 4) check if all utf-8 symbols were reached from the tex input and iterate. Suggestions are welcome.

physikerwelt commented 9 years ago

@brucemiller ... and of course you were right David Carlisle wrote an XSLT http://www.w3.org/Math/characters/unicode.xml which was the origin of the mappings I already have generated by this https://gist.github.com/piquadrat/798546 snippet.

dginev commented 9 years ago

Btw a simple google search brought me to this page - http://milde.users.sourceforge.net/LUCR/Math/

Just in case you haven't found it. In this process, if you find UTF-8 symbols that are incorrectly connected to the TeX commands, or you find missing TeX commands, please let us know and we'll add them to LaTeXML. Good luck!

physikerwelt commented 9 years ago

@dginev @brucemiller finally I arrived at the desired map https://github.com/physikerwelt/utf8tex/blob/master/unicode2tex.csv thanks for your help. This map was generated using the https://github.com/physikerwelt/utf8tex/blob/master/xml2csv.xslt and prefers the unicode-math latex package over the math-latex field that uses other packages over the latex field in the w3c unicode xml file. If you are interested I can create a list of which commands render with LaTeXML?

dginev commented 9 years ago

I think having a list of commands that succeed/fail with LaTeXML will be very valuable for us, then we can make sure all of them work. Thanks a lot for offering!

dginev commented 9 years ago

And the obvious next remark would be to just document the list and make it public, tools such as KaTeX and MathJax may also benefit from the map.

physikerwelt commented 9 years ago

Technically the issue is resolved. https://github.com/TU-Berlin/mathosphere/blob/master/mathosphere-core/src/main/resources/unicode2tex.csv Testing will be performed in a second step... ... developing with snail speed ...