allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
420 stars 54 forks source link

Refactor: Remove code for detecting equations, definitions, terms, sentences, and symbols #322

Open andrewhead opened 3 years ago

andrewhead commented 3 years ago

This refactor is part of a larger effort to simplify the paper processing code. An alternative (faster, higher-accuracy) pipeline has been developed for detecting sentences, symbols, and definitions. This PR removes the old code for detecting these entities, to make it clear that this original code for detecting those entities is no longer maintained, and to make the remaining code much smaller, and hopefully therefore easier read and maintain.

A summary of changes follows:

  1. Remove all entities directories for terms, sentences, sentences-pdf, definitions, and glossary_terms (this accounts for the majority of deleted files)
  2. Remove types from common.types that solely existed as data structures for the above entities
  3. Remove data loading commands for these entities from common.file_utils
  4. Remove TeX processing code that only served the purpose of assisting in the detection of these entities (e.g., remove the EquationExtractor, etc.)
  5. Remove unit tests for these entity types, and supporting test data files (e.g., tests/mathml-fragments/*)

Following these changes, the full complement of unit tests still pass:

After running pytest ., the following appears:

======================================================= test session starts =======================================================
platform darwin -- Python 3.7.0, pytest-5.3.1, py-1.10.0, pluggy-0.13.1
rootdir: /Users/andrew/github/scholarphi-refactor/data-processing, inifile: pytest.ini
plugins: cov-2.5.1
collected 105 items

tests/test_bounding_box.py ..................                                                                               [ 17%]
tests/test_colorize_tex.py .......                                                                                          [ 23%]
tests/test_compile.py ........                                                                                              [ 31%]
tests/test_normalize_tex.py ..............                                                                                  [ 44%]
tests/test_parse_tex.py ..............                                                                                      [ 58%]
tests/test_scan_tex.py .....                                                                                                [ 62%]
tests/test_string.py .............                                                                                          [ 75%]
tests/test_unpack.py ....                                                                                                   [ 79%]
tests/test_visual_validate.py ......                                                                                        [ 84%]
tests/common/test_fetch_arxiv.py ......                                                                                     [ 90%]
tests/common/commands/test_fetch_arxiv_sources.py ..                                                                        [ 92%]
tests/common/commands/test_fetch_s2_data.py ........                                                                        [100%]

======================================================= 105 passed in 1.18s =======================================================

I also re-ran extraction of citations to ensure that it still runs:

mkdir output
python scripts/run_pipeline.py -v --entities citations --output-forms file --output-dir output --arxiv-ids 2009.14237v2`

This is the contents of the generated output/citations.jsonl file:

{"format_version": "v0", "data": [{"id_": "ref:abekawa2016sidenoter-0", "type_": "citation", "bounding_boxes": [{"left": 0.2549019607843137, "top": 0.7929292929292929, "width": 0.0016339869281045752, "height": 0.006313131313131313, "page": 3, "tex_path": "N/A", "entity_id": "ref:abekawa2016sidenoter"... }]}

Before this branch is merged into 'main', PRs will be submitted to incorporate the new code for detecting sentences, symbols, and definitions. An initial version of this code is implemented in the repository https://github.com/allenai/pdftexalign.

This PR should be discussed with @ca16 to discuss places where changes in this code base will require analogous changes in the scholarphi pipeline container code.

ca16 commented 3 years ago

@andrewhead just want to make sure I follow the plan...

  1. Merge this code into chi-2021-demo
  2. Merge code from https://github.com/allenai/pdftexalign into chi-2021-demo.
  3. Merge changes from 1 and 2 into main.

Is that right?

If so, and related your question about analogous code changes in scholarphi-pipeline, what code is it that you would like scholarphi-pipeline to be calling? Right now, it works off the chi-2021-demo branch. We could have it continue to do so after 1. (I think the one change needed on the scholarphi-pipeline side for this, apart from bumping the image, would be to stop running 2% of papers through expecting them to get more than citations). We could also have it run off the chi-2021-demo branch after 2. At that point perhaps we could reintroduce the part where 2% of papers try to get more than just citations. (Both these change descriptions assume that the way we call scholarphi to get just citations output doesn't change.) If we wanted to run scholarphi-pipeline off main, we should probably figure out what other changes that exist in chi-2021-demo we want to bring in too.

andrewhead commented 3 years ago

@andrewhead just want to make sure I follow the plan... Is that right?

Yes, that's right! Though I was thinking of first merging both steps 1 and 2 into an experimental branch (e.g., "chi-2021-demo-refactor") one at a time before merging that branch into "chi-2021-demo".

That way, we would not have to turn off the proccessing of the 2% of papers with more than citations, but rather modify the way that symbol and definition commands get called from 'scholarphi-pipeline,' once the merge of the experimental branch into "chi-2021-demo" was ready.

what code is it that you would like scholarphi-pipeline to be calling?

I am thinking we could initially have it call the "chi-2021-demo" code, then once we successfully merge into "main," we could have scholarphi-pipeline run off of "main." What do you think of that idea?

ca16 commented 3 years ago

Yes, that's right! Though I was thinking of first merging both steps 1 and 2 into an experimental branch (e.g., "chi-2021-demo-refactor") one at a time before merging that branch into "chi-2021-demo". That way, we would not have to turn off the proccessing of the 2% of papers with more than citations, but rather modify the way that symbol and definition commands get called from 'scholarphi-pipeline,' once the merge of the experimental branch into "chi-2021-demo" was ready.

That sounds good.

I am thinking we could initially have it call the "chi-2021-demo" code, then once we successfully merge into "main," we could have scholarphi-pipeline run off of "main." What do you think of that idea?

I think it would be great to start running off main again! Do you think we would bring over all of the changes that exist in chi-2021-demo at that point, or do you think we would have to be at all selective? Asking because I'm remembering a talk I had with @kyleclo after the CHI demo where we weren't sure what exactly should get pulled into main from that branch... Though perhaps the changes you're making with this and the next PR would address that? Adding a note here to reflect what I think was a helpful piece of that conversation: we don't have to necessarily deploy anything that gets merged to scholarphi's main and affects data-processing to scholarphi-pipeline - we can limit it to certain commits (possibly tagged as releases).

andrewhead commented 3 years ago

Do you think we would bring over all of the changes that exist in chi-2021-demo at that point, or do you think we would have to be at all selective?

Hm... I am not sure yet. Honestly, I cannot remember all of the changes between the 'main' branch and 'chi-2021-demo' 😬 I think we will want to roll in all of the changes from the data-processing directory, though I think the ui and api directories should also be checked to see if any code is present in those branches that we do not want to be deployed.

andrewhead commented 3 years ago

Adding a note here to reflect what I think was a helpful piece of that conversation: we don't have to necessarily deploy anything that gets merged to scholarphi's main and affects data-processing to scholarphi-pipeline - we can limit it to certain commits (possibly tagged as releases).

This sounds like a good idea to me! I can imagine this making the deployed pipeline less prone to error because it would require one of us to explicitly indicate in the code which versions of the code had been evaluated as capable of running correctly. Is that the idea?

andrewhead commented 3 years ago

(Andrew's personal notes for upcoming tasks as part of this refactor:)

andrewhead commented 3 years ago

I guess I'm wondering if it's the case that the format of the same entity with the new code will be different enough that the existing types wouldn't be particularly useful anymore?

This is a great point! When I could, I left in the instances of types that I think will be useful to the new code. There were a bunch of types that did not make the cut, as they no longer have an analog in the new code (for instance, entities that have LaTeX 'start' and 'end' character offsets, where the new code will be mostly agnostic to LateX entity positions).

I still think that bumping the image that scholarphi-pipeline uses will probably be a more or less manual thing for the foreseeable future...

This sounds fine to me!

And just to be clear, I'm talking about changes in data-processing making it into the scholarphi-pipeline environment, not api or. UI changes making it into the relevant Skiff environments

Thank you for the clarification! I think we are on the same page there.