Goal: extract math Latex from .tex
content available from arXiv.
Caveat when cloning this repo: Total download size is 640 MB.
Read latex-in-arxiv/postings_list/README.md
Everything is containerized, so in this repo (latex-in-arxiv/
) use
either make docker
(for linux) or make docmac
(for Mac).
To run the application, within the Docker image run /opt/scanner.out .
To recompile the scanner, within the Docker image run
cd latex-in-arxiv/src/postings_list/query
make scanner
make read_tf_idf
./scanner.out .
./scanner.out . offsets
./read_tf_idf.out tf_idf # the vocabulary for TF-IDF uses the tokens from parsed Latex
# TF-IDF is for identify the most relevant variable to find the definition for in a paper
Suppose you have a .tex
file that contains math, like
\documentclass{article}
\title{test}
\begin{document}
\maketitle
\section{Introduction}
This is a great paper.
\begin{equation}
a+b = c
\end{equation}
Where $c$ is some variable.
\end{document}
There's an expression, a+b=c
and an in-line variable c
.
How can the expression and the variables be extracted?
There are a few options for parsing Latex; see https://github.com/allofphysicsgraph/latex-in-arxiv/issues/14 The options that are decent in terms of quality of results are also slow.
This repo uses ragel
to quickly parse Latex and find math.
https://www.cs.cornell.edu/projects/kddcup/datasets.html
In the directory latex-in-arxiv/get_sample_data
use
make get_sample_data
# curl http://export.arxiv.org/api/query?search_query=all:rigorous%20derivation
for details, see https://arxiv.org/help/bulk_data_s3
# s3cmd get s3://arxiv/src/arXiv_src_manifest.xml . --requester-pays
# s3cmd get s3://arxiv/src/arXiv_src_9912_001.tar . --requester-pays