allofphysicsgraph / latex-in-arxiv

extract math latex from content in arxiv
4 stars 1 forks source link

getting math expressions from .tex files in arxiv - fails if \begin{equation} is part of newcommand #1

Closed bhpayne closed 1 year ago

bhpayne commented 4 years ago

Using the following regex fails if \begin{equation} is part of newcommand

with open(this_file,'rb') as f:
    data = f.read()
resp = re.findall('\\\\begin\s*{(?:eqnarray|equation|multiline)}.*?end\s*{(?:eqnarray|equation|multiline)}',
                  str(data),
                  re.DOTALL)
msgoff commented 3 years ago

I found a python script that may resolve this issue.

https://cs-web.bu.edu/faculty/gacs/software/de-macro/de-macro. I tested on one file, and the script ran, created a new file with the macros expanded.

msgoff commented 1 year ago

Unfortunately the re.findall expression will only work for simple cases.

Formally regular expressions cannot match balanced parentheses. https://en.wikipedia.org/wiki/Regular_language

And requires Push Down Automata https://eecs.wsu.edu/~ananth/CptS317/Lectures/PDA.pdf When observing the left paren or simple expression such as \begin{equation} push onto the stack when observing the right paren or simple expression \end{equation} pop one off of the stack. When the stack is empty stop consuming characters.

This tokenizer is good for expressions where the number of parens, brackets, braces are equal.

from nltk.tokenize import SExprTokenizer tokenizer=SExprTokenizer(parens='{}',strict=True) tokenizer.tokenize(data)

In the case for LaTeX approximately 50% of the files in the sample dataset do not have the same number of left braces as right braces, the same applies for parentheses.

In [18]: for f_name in tqdm(files): ...: try: ...: data = read_file(path,f_name) ...: if data.count('{') != data.count('}'): ...: count+=1 ...: except Exception as e: ...: print(e) ...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1019/1019 [00:01<00:00, 875.45it/s]

In [19]: count Out[19]: 509

In [20]: len(files) Out[20]: 1019

Because of the general complexity, Different modes or Island grammars can be used to do error-detection and correction.

https://github.com/ondrejivanic/antlr4-island-grammar

Do the left and right delimiters of the control sequences match?

try to expand macros Does the issue still remain?

compile the latex-file What do the TEX logs show?

msgoff commented 1 year ago

Ideally the macros should be expanded and removed from the document prior to parsing. An easier work around for now could be to hide the macros from the parsers looking for mathematical expressions.