allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
415 stars 52 forks source link

Speed up entity bounding box detection #116

Open andrewhead opened 4 years ago

andrewhead commented 4 years ago

In the current version of the pipeline, the highest accuracy entity detections come from detecting one entity at a time. This does not scale well (i.e., it leads to some papers taking an hour or more to process).

This issue proposes how to speed up the detection of entities. Ideas include:

andrewhead commented 4 years ago

Also of note is this alternative technique for colorizing spans of text in LaTeX. I suspect it would run into the very same errors of whatsits causing subtle changes in the spacing of text, though it might be worth looking at further: https://tex.stackexchange.com/a/116907/198728

andrewhead commented 4 years ago

One additional idea for scaling up the coloring is to copy over the output and auxiliary files from the uncolorized LaTeX before compiling the colorized code, with the hopes that only the last LaTeX compilation needs to be re-run.

andrewhead commented 4 years ago

Another idea for speeding up entity bounding box detection is:

  1. Find the bounding boxes for each token within a paper (word or character)
  2. Align those characters to the LaTeX representation (e.g., by extracting sentences, splitting sentences in both LaTeX and in the PDF output, finding the ones that align).
  3. Skip the visual diff'ing in the processing pipeline. Instead, detect what characters in the LaTeX entities occur at. Then look up bounding boxes for the aligned tokens extracted from Grobid.
andrewhead commented 4 years ago

comments from @kyleclo on the above comment, copied from zoom chat:

let's suppose we can get this for every token in grobid, and only math symbols in latex (e.g. between $$)

[{'start': 0, 'end': 5, 'token': 'hello', 'bbox_id': 12345}, …]

in latex

"This is a parameter $\gamma = \frac{a}{b}$ where…"

bboxes from grobid:

"This", "is", "a", "parameter", "where"

bboxes from latex:

"\gamma", "a", "b"

An NLP model (e.g. Dongyeop's) takes as input the latex and returns that it wants the LaTeX sentence above. but we only have bounding boxes for the symbols, not the entire sentence. How do we surface the bounding box for the entire sentence?

We can identify grobid bbox candidates from same page number as "\gamma", "a", and "b" Same bounding box vicinity as "\gamma", "a", and "b" We can then fuzzy match:

in grobid

"This is a parameter u\1235 = a b where…"

to the latex sentence.. somehow

andrewhead commented 4 years ago

Following up on the above idea...

One additional idea for scaling up the coloring is to copy over the output and auxiliary files from the uncolorized LaTeX before compiling the colorized code, with the hopes that only the last LaTeX compilation needs to be re-run.

I spent a bit of time looking into the AutoTeX source code. My conclusion is that we will need to make some light modifications to the AutoTeX source code, or monkey patch it, if we want to reduce the number of compilations needed for per paper

The issue seems to be a method called trash_tex_aux_files, which removes all of the aux files generated by LaTeX once the paper has finished compiling. AutoTeX would need to be modified so that it did not remove those aux files.

I think this effort is probably worth it. It just stinks that it means that we won't be able to just merely install AutoTeX through cpanm to get the most recent version

andrewhead commented 4 years ago

Some of these ideas were implemented as of a recent commit 3e0abb4. Specifically:

andrewhead commented 4 years ago

One important source of speed up is to make it faster to compile TeX projects and raster their pages. For one of the papers we're processing for the S2 reading group---https://arxiv.org/abs/1909.13433---there's a case where it takes 40 seconds to raster the pages from the paper. Turns out, the PDF embeds about a dozen other PDFs as figures, and those PDFs I think have thousands of objects in them. This seems to be the source of causing it to take so long to compile.

My fix for processing the paper was to open the directory, change all of the figures from PDFs to PNGs (and update the references to those figures to point to the PNGs), and then package up the directory again, using this as the archive for that TeX project. This decreased the rastering time to no more than 1 or 2 seconds. In the future, this might be something that could be useful to automated in the pipeline.

andrewhead commented 3 years ago

This issue is being closed as it is overly broad. See #132 for a concrete discussion of one idea for further improving the speed of the pipeline.

andrewhead commented 3 years ago

I'm reopening this issue as I'm hoping to speed up entity processing for some of the papers.

In my recent analyses of pipeline performance on s mall number of long-running papers, I have found that the longest-running stage is always compiling the LaTeX. Then, the second-longest stage is either locating hues in the image diffs, or, in just a few cases, rastering the images. It does not appear to take very long to difference the images, or to scan and colorize the TeX.

image

The implications of this is that we can likely get the most payoff if we decrease the time spent compiling LaTeX. To repeat some of the ideas we've thought through, they include:

  1. preserving auxiliary files across compilations
  2. processing larger batches of entities at once (reducing the number of compilations needed)
  3. seeing if compilation time can be reduced by editing the LaTeX. For instance, it appears that paper 1612.00188 takes longer to compile than most other papers. Why is that? Can we make structure-preserving changes to the LaTeX to make it faster to process?