allofphysicsgraph / latex-in-arxiv

extract math latex from content in arxiv
4 stars 1 forks source link

Extract the abstract syntax tree from a latex math expression #18

Open bhpayne opened 7 months ago

bhpayne commented 7 months ago

Latex is for presentation. Getting an abstract syntax tree for a Latex math expression is a critical step for searchability and semantic enrichment.

  1. Parse the Latex expression into symbols (like a, b, x_1, \vec{z}) and operators (*, \int, >)
  2. create an abstract syntax tree for the symbols and operators

Caveat: scientists aren't consistent in their notation, so there might be conflicting ways to interpret a Latex math string.

Caveat: not all symbols to be identified are in the string. For example, a b might refer to "a multiplied by b"

bhpayne commented 7 months ago

While an arbitrary AST would be great, currently the Physics Derivation Graph uses SymPy to check steps. SymPy's Latex-to-AST is impressive.

My struggles with SymPy:

and a reminder to myself on SymPy's process: https://physicsderivationgraph.blogspot.com/2020/08/how-to-edit-sympy-latex-parser-and.html