allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
420 stars 54 forks source link

Write citations output to file #318

Closed ca16 closed 3 years ago

ca16 commented 3 years ago

Related to https://github.com/allenai/scholar/issues/28994.

As part of getting citations info from scholarphi piped through the s2airs system, we want to get citations info into the annotation store. We can adapt the wrapper system (scholarphi-pipeline) to do this, if it has access to the output that the inner scholarphi code produces.

The simplest way that I found to do this while I was working on https://github.com/allenai/scholar/issues/28297 was to just dump the output as json into a file in a location provided by the caller.

This PR adapts the DatabaseUploadCommand a little, making it more of a 'save output' kind of thing (I haven't renamed it to try to keep this PR small, and avoid changing more than is necessary). The idea is the user can specify whether they want output saved to a db (what we've been doing up until now), saved to a file (what we want for s2airs), or both. To begin with though, only the citations extension of DatabaseUploadCommand will know how to actually save to a file. Other commands extending DatabaseUploadCommand should error out if the user tries to specify that they want output saved to a file for those. To get an idea of how we could extend this to all entities, check out this branch: https://github.com/allenai/scholarphi/compare/chi-2021-demo...chloea-try-producing-json-take-1 (note: it does have some extra stuff around pulling arxiv pdfs that isn't relevant to this, and the saving to file stuff is less neat, and the 'both' option doesn't exist, but hopefully it gets the idea around extending to other entities across).

The plan is that the existing scholarphi-pipeline system would use the 'both' option when processing in citations only mode, so that we can continue to support the current reader, and so that we can put stuff into the annotation store as we build out s2airs.

Testing done: I added some tests and ran pytest:

# pytest --all
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.10.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items                                                                                                                                                                                                                          

tests/test_bounding_box.py ..................                                                                                                                                                                                          [  8%]
tests/test_colorize_sentences.py .....                                                                                                                                                                                                 [ 10%]
tests/test_colorize_tex.py ........                                                                                                                                                                                                    [ 14%]
tests/test_compile.py ........                                                                                                                                                                                                         [ 18%]
tests/test_extract_definitions.py .................ssss                                                                                                                                                                                [ 28%]
tests/test_locate_symbols.py ...                                                                                                                                                                                                       [ 29%]
tests/test_match_symbols.py ......                                                                                                                                                                                                     [ 32%]
tests/test_normalize_tex.py ..............                                                                                                                                                                                             [ 38%]
tests/test_parse_equation.py .........................                                                                                                                                                                                 [ 50%]
tests/test_parse_tex.py ..............................................                                                                                                                                                                 [ 71%]
tests/test_sanitize_equation.py ....                                                                                                                                                                                                   [ 73%]
tests/test_scan_tex.py .....                                                                                                                                                                                                           [ 76%]
tests/test_string.py .............                                                                                                                                                                                                     [ 82%]
tests/test_unpack.py ....                                                                                                                                                                                                              [ 84%]
tests/test_visual_validate.py ......                                                                                                                                                                                                   [ 86%]
tests/common/test_fetch_arxiv.py ......                                                                                                                                                                                                [ 89%]
tests/common/test_upload_entities.py ...                                                                                                                                                                                               [ 91%]
tests/common/commands/test_database.py .........                                                                                                                                                                                       [ 95%]
tests/common/commands/test_fetch_arxiv_sources.py ..                                                                                                                                                                                   [ 96%]
tests/common/commands/test_fetch_s2_data.py ........                                                                                                                                                                                   [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

entities/definitions/commands/tokenize_sentences.py:111
  /data-processing/entities/definitions/commands/tokenize_sentences.py:111: DeprecationWarning: invalid escape sequence \s
    "EQUATION_DEPTH_0_START\s*(.*?)\s*EQUATION_DEPTH_0_END",

entities/glossary_terms/colorize.py:24
  /data-processing/entities/glossary_terms/colorize.py:24: DeprecationWarning: invalid escape sequence \S
    first_nonspace = re.search("\S", term.tex)

entities/glossary_terms/colorize.py:32
  /data-processing/entities/glossary_terms/colorize.py:32: DeprecationWarning: invalid escape sequence \S
    last_nonspace = re.search("\S(?=\s*$)", term.tex)

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
  /usr/local/lib/python3.7/dist-packages/catalogue.py:138: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
  /usr/local/lib/python3.7/dist-packages/catalogue.py:126: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================================================== 210 passed, 4 skipped, 24 warnings in 187.01s (0:03:07) ===========================================================================================

I ran one paper through using the file only format and just citations:

root@635a2b2ca566:/data-processing# mkdir hi
root@635a2b2ca566:/data-processing# python scripts/run_pipeline.py --one-paper-at-a-time --arxiv-ids "1601.00978v1" --output-form file --output-dir hi --entities citations
...
root@635a2b2ca566:/data-processing# ls hi/
citations.jsonl
root@635a2b2ca566:/data-processing# cat hi/citations.jsonl 
{"version": "v0", "data": [{"id_": "bandeira_automatic_2010-0", "type_": "citation", "bounding_boxes": [{"left": 0.47549019607843135, "top": 0.30808080808080807, "width": 0.0016339869281045752, "height": 0.008838383838383838, "page": 0, "tex_path": "N/A", "entity_id": "bandeira_automatic_2010-0"}], "data": {"key": "bandeira_automatic_2010", "paper_id": "9769ac9d79a75cfefd3731978b9f319d68345ab5"}, "relationships": null}, {"id_": "bandeira_automatic_2010-1", "type_": "citation", "bounding_boxes": [{"left": 0.4035947712418301, "top": 0.3686868686868687, "width": 0.0016339869281045752, "height": 0.008838383838383838, "page": 0, "tex_path": "N/A", "entity_id": "bandeira_automatic_2010-1"}], "data": {"key": "bandeira_automatic_2010", "paper_id": "9769ac9d79a75cfefd3731978b9f319d68345ab5"}, "relationships": null}, {"id_": "bandeira_automatic_2010-2", "type_": "citation", "bounding_boxes": [{"left": 0.8137254901960784, "top": 0.5404040404040404, "width": 0.0016339869281045752, "height": 0.008838383838383838, "page": 1, "tex_path": "N/A", "entity_id": "bandeira_automatic_2010-2"}], "data": {"key": "bandeira_automatic_2010", "paper_id": "9769ac9d79a75cfefd3731978b9f319d68345ab5"}, "relationships": null}, {"id_": "bandeira_automatic_2010-3", "type_": "citation", "bounding_boxes": [{"left": 0.25326797385620914, "top": 0.5896464646464646, "width": 0.0016339869281045752, "height": 0.008838383838383838, "page": 1, "tex_path": "N/A", "entity_id": "bandeira_automatic_2010-3"}], "data": {"key": "bandeira_automatic_2010", "paper_id": "9769ac9d79a75cfefd3731978b9f319d68345ab5"}, "relationships": null}, {"id_": "urbach_automatic_2009-0", "type_": "citation", "bounding_boxes": [{"left": 0.2957516339869281, "top": 0.3686868686868687, "width": 0.004901960784313725, "height": 0.008838383838383838, "page": 0, "tex_path": "N/A", "entity_id": "urbach_automatic_2009-0"}], "data": {"key": "urbach_automatic_2009", "paper_id": "b4caa57f2321298073d651e95507052ccf2ed66d"}, "relationships": null}, {"id_": "urbach_automatic_2009-1", "type_": "citation", "bounding_boxes": [{"left": 0.23202614379084968, "top": 0.6944444444444444, "width": 0.004901960784313725, "height": 0.008838383838383838, "page": 1, "tex_path": "N/A", "entity_id": "urbach_automatic_2009-1"}], "data": {"key": "urbach_automatic_2009", "paper_id": "b4caa57f2321298073d651e95507052ccf2ed66d"}, "relationships": null}, {"id_": "ding_sub-kilometer_2011-0", "type_": "citation", "bounding_boxes": [{"left": 0.4166666666666667, "top": 0.3686868686868687, "width": 0.006535947712418301, "height": 0.008838383838383838, "page": 0, "tex_path": "N/A", "entity_id": "ding_sub-kilometer_2011-0"}], "data": {"key": "ding_sub-kilometer_2011", "paper_id": "447369e1c509e7e552220fa00eec93e474e1167b"}, "relationships": null}, {"id_": "krizhevsky_imagenet_2012-0", "type_": "citation", "bounding_boxes": [{"left": 0.16830065359477125, "top": 0.5202020202020202, "width": 0.006535947712418301, "height": 0.008838383838383838, "page": 0, "tex_path": "N/A", "entity_id": "krizhevsky_imagenet_2012-0"}], "data": {"key": "krizhevsky_imagenet_2012", "paper_id": "abd1c342495432171beb7ca8fd9551ef13cbd0ff"}, "relationships": null}, {"id_": "krizhevsky_imagenet_2012-1", "type_": "citation", "bounding_boxes": [{"left": 0.1568627450980392, "top": 0.38257575757575757, "width": 0.006535947712418301, "height": 0.008838383838383838, "page": 1, "tex_path": "N/A", "entity_id": "krizhevsky_imagenet_2012-1"}], "data": {"key": "krizhevsky_imagenet_2012", "paper_id": "abd1c342495432171beb7ca8fd9551ef13cbd0ff"}, "relationships": null}]}

I ran it with the file only option and a non-citations entity (equations). It failed (as we would expect):

python scripts/run_pipeline.py --one-paper-at-a-time --arxiv-ids "1601.00978v1" --output-form file --output-dir hi --entities equations
...
AssertionError: C does not know how to write to a file.

Note: this PR is for merging into the chi-2021-demo branch, as that is what the scholarphi-pipeline system is still running off of.

ca16 commented 3 years ago

@andrewhead thanks for the comments! They were helpful.

I've made a few updates based on your suggestions, and attempted to answer your questions.

I think the one still open question is around using Literal instead of Enum, from this comment.

andrewhead commented 3 years ago

hi @ca16 ! Awesome PR, I am already using these flags!!

Would it make sense to rename the extension of the output from .jsonl to .json, given that the file contains just one line and can parsed as single JSON object? Sorry, I should have thought to ask this before, though I only realized it when I started working with the file.

ca16 commented 3 years ago

Yeah I think that's a good idea! I can make the change.