Closed bhpayne closed 1 year ago
I found a python script that may resolve this issue.
https://cs-web.bu.edu/faculty/gacs/software/de-macro/de-macro. I tested on one file, and the script ran, created a new file with the macros expanded.
Unfortunately the re.findall expression will only work for simple cases.
Formally regular expressions cannot match balanced parentheses. https://en.wikipedia.org/wiki/Regular_language
And requires Push Down Automata https://eecs.wsu.edu/~ananth/CptS317/Lectures/PDA.pdf When observing the left paren or simple expression such as \begin{equation} push onto the stack when observing the right paren or simple expression \end{equation} pop one off of the stack. When the stack is empty stop consuming characters.
This tokenizer is good for expressions where the number of parens, brackets, braces are equal.
from nltk.tokenize import SExprTokenizer tokenizer=SExprTokenizer(parens='{}',strict=True) tokenizer.tokenize(data)
In the case for LaTeX approximately 50% of the files in the sample dataset do not have the same number of left braces as right braces, the same applies for parentheses.
In [18]: for f_name in tqdm(files): ...: try: ...: data = read_file(path,f_name) ...: if data.count('{') != data.count('}'): ...: count+=1 ...: except Exception as e: ...: print(e) ...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1019/1019 [00:01<00:00, 875.45it/s]
In [19]: count Out[19]: 509
In [20]: len(files) Out[20]: 1019
Because of the general complexity, Different modes or Island grammars can be used to do error-detection and correction.
https://github.com/ondrejivanic/antlr4-island-grammar
Do the left and right delimiters of the control sequences match?
try to expand macros Does the issue still remain?
compile the latex-file What do the TEX logs show?
Ideally the macros should be expanded and removed from the document prior to parsing. An easier work around for now could be to hide the macros from the parsers looking for mathematical expressions.
Using the following regex fails if
\begin{equation}
is part ofnewcommand