Speed up entity bounding box detection

andrewhead commented 4 years ago

In the current version of the pipeline, the highest accuracy entity detections come from detecting one entity at a time. This does not scale well (i.e., it leads to some papers taking an hour or more to process).

This issue proposes how to speed up the detection of entities. Ideas include:

To batch process despite LaTeX compilation errors: Add a print message to the LaTeX before each entity colorization command. If that print message appears right before a LaTeX parse error, add that colorization command to a skiplist, remove it from the batch, terminate TeX compilation as quickly as possible, and try again.
To batch process despite colorization changing text spacing: A better way is needed to detect which colorization commands change the spacing of the text. In single-column papers, there is a simple approach: find the first symbol directly to the left or above the shifted text; it is likely the cause of the shifted text. Add it to the skiplist, remove it from the batch, and try again. For two-column papers, or trickier cases in single-column papers, a more sophisticated approach is needed.
1. Perhaps optical flow can be used to detect which symbols in a batch have shifted positions, and the first symbol before the shifted symbols is marked as disruptive and removed from the batch.
2. Similar heuristics can be used as those proposed for single-column papers, accepting that sometimes batch-processing will still be inefficient, because the wrong symbols are getting removed from the batch
3. The text after each symbol can be given a color. When text shifts and it has a specific color, it will be known that it follows a specific symbol. That symbol can be removed from the batch. It's my intuition would provide the best trade-off for accuracy in detecting which symbols cause spacing issues, while being somewhat straightforward to implement.
4. The text for each paragraph can be assigned a different color. That way, it is known in which paragraph the text first started to shift, and the symbol that caused the shift would be the first one in the paragraph to appear in a pixel position before the shifted text (i.e., right to the left of, or right above). This gets trickier if a paragraph is split across columns, though heuristics could be used to detect which part of the paragraph appeared in earlier columns (i.e., by looking at horizontal spacing between chunks of color that belonged to a column). The advantage of this approach is that colorization commands added at the very start and end of a paragraph, I expect (though don't know for sure), would be less likely to introduce changes to the text spacing themselves.

andrewhead commented 4 years ago

Also of note is this alternative technique for colorizing spans of text in LaTeX. I suspect it would run into the very same errors of whatsits causing subtle changes in the spacing of text, though it might be worth looking at further: https://tex.stackexchange.com/a/116907/198728

andrewhead commented 4 years ago

One additional idea for scaling up the coloring is to copy over the output and auxiliary files from the uncolorized LaTeX before compiling the colorized code, with the hopes that only the last LaTeX compilation needs to be re-run.

andrewhead commented 4 years ago

Another idea for speeding up entity bounding box detection is:

Find the bounding boxes for each token within a paper (word or character)
Align those characters to the LaTeX representation (e.g., by extracting sentences, splitting sentences in both LaTeX and in the PDF output, finding the ones that align).
Skip the visual diff'ing in the processing pipeline. Instead, detect what characters in the LaTeX entities occur at. Then look up bounding boxes for the aligned tokens extracted from Grobid.

andrewhead commented 4 years ago

comments from @kyleclo on the above comment, copied from zoom chat:

let's suppose we can get this for every token in grobid, and only math symbols in latex (e.g. between $$)

[{'start': 0, 'end': 5, 'token': 'hello', 'bbox_id': 12345}, …]

in latex

"This is a parameter $\gamma = \frac{a}{b}$ where…"

bboxes from grobid:

"This", "is", "a", "parameter", "where"

bboxes from latex:

"\gamma", "a", "b"

An NLP model (e.g. Dongyeop's) takes as input the latex and returns that it wants the LaTeX sentence above. but we only have bounding boxes for the symbols, not the entire sentence. How do we surface the bounding box for the entire sentence?

We can identify grobid bbox candidates from same page number as "\gamma", "a", and "b" Same bounding box vicinity as "\gamma", "a", and "b" We can then fuzzy match:

in grobid

"This is a parameter u\1235 = a b where…"

to the latex sentence.. somehow

andrewhead commented 4 years ago

Following up on the above idea...

One additional idea for scaling up the coloring is to copy over the output and auxiliary files from the uncolorized LaTeX before compiling the colorized code, with the hopes that only the last LaTeX compilation needs to be re-run.

I spent a bit of time looking into the AutoTeX source code. My conclusion is that we will need to make some light modifications to the AutoTeX source code, or monkey patch it, if we want to reduce the number of compilations needed for per paper

The issue seems to be a method called trash_tex_aux_files, which removes all of the aux files generated by LaTeX once the paper has finished compiling. AutoTeX would need to be modified so that it did not remove those aux files.

I think this effort is probably worth it. It just stinks that it means that we won't be able to just merely install AutoTeX through cpanm to get the most recent version

andrewhead commented 4 years ago

Some of these ideas were implemented as of a recent commit 3e0abb4. Specifically:

Entities are batch-processed by default
Compilation logs are scanned to try to detect which entities are causing compilation failures, and skip processing them
The image processing part of the pipeline looks for entities that have shifted location. It removes the first one from each set, and attempts to locate it on its own

andrewhead commented 4 years ago

One important source of speed up is to make it faster to compile TeX projects and raster their pages. For one of the papers we're processing for the S2 reading group---https://arxiv.org/abs/1909.13433---there's a case where it takes 40 seconds to raster the pages from the paper. Turns out, the PDF embeds about a dozen other PDFs as figures, and those PDFs I think have thousands of objects in them. This seems to be the source of causing it to take so long to compile.

My fix for processing the paper was to open the directory, change all of the figures from PDFs to PNGs (and update the references to those figures to point to the PNGs), and then package up the directory again, using this as the archive for that TeX project. This decreased the rastering time to no more than 1 or 2 seconds. In the future, this might be something that could be useful to automated in the pipeline.

andrewhead commented 3 years ago

This issue is being closed as it is overly broad. See #132 for a concrete discussion of one idea for further improving the speed of the pipeline.

andrewhead commented 3 years ago

I'm reopening this issue as I'm hoping to speed up entity processing for some of the papers.

In my recent analyses of pipeline performance on s mall number of long-running papers, I have found that the longest-running stage is always compiling the LaTeX. Then, the second-longest stage is either locating hues in the image diffs, or, in just a few cases, rastering the images. It does not appear to take very long to difference the images, or to scan and colorize the TeX.

The implications of this is that we can likely get the most payoff if we decrease the time spent compiling LaTeX. To repeat some of the ideas we've thought through, they include:

preserving auxiliary files across compilations
processing larger batches of entities at once (reducing the number of compilations needed)
seeing if compilation time can be reduced by editing the LaTeX. For instance, it appears that paper 1612.00188 takes longer to compile than most other papers. Why is that? Can we make structure-preserving changes to the LaTeX to make it faster to process?

allenai / scholarphi

Speed up entity bounding box detection #116