allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
416 stars 52 forks source link

Add bib entry text to citations output #351

Closed ca16 closed 2 years ago

ca16 commented 2 years ago

Related to https://github.com/allenai/scholar/issues/31749.

The idea is to include the raw text we have for a bib entry in the output produced by the citations pipeline, because we might want to surface this in cases where we failed to match a bib entry to an S2 paper.

This PR adjusts the 'upload citations' command to include bib entry texts (when we have them) in the citations pipeline output, both when we write output to a db, and when we write output to a file. In the case where two bib entries have the same key, we end up with the bib entry text associated with the last one we see, and log a warning.

NOTE: we've also talked about including mentions linked to bib entries missing matches to S2 papers in the citations pipeline's output. That's coming up in a separate PR. This PR just adds bib entry text to the output.

Testing

I tested this out by running a couple of papers through. Example output files: 1611.07004v3-current.txt 1611.07004v3-with-texts-2.txt 2009.12303v4-current.txt 2009.12303v4-with-texts-2.txt

You can see snippets in this comment.

I also ran the data-processing tests described here: https://github.com/allenai/scholarphi/tree/chi-2021-demo/data-processing#running-tests

# pytest -m slow
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items / 199 deselected / 15 selected                                                                                                                                                                                           

tests/test_extract_definitions.py ...........ssss                                                                                                                                                                                      [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

/usr/lib/python3/dist-packages/colorama/ansitowin32.py:49
  /usr/lib/python3/dist-packages/colorama/ansitowin32.py:49: DeprecationWarning: invalid escape sequence \[
    ANSI_CSI_RE = re.compile('\001?\033\[((?:\d|;)*)([a-zA-Z])\002?')     # Control Sequence Introducer

/usr/lib/python3/dist-packages/colorama/ansitowin32.py:50
  /usr/lib/python3/dist-packages/colorama/ansitowin32.py:50: DeprecationWarning: invalid escape sequence \]
    ANSI_OSC_RE = re.compile('\001?\033\]((?:.|;)*?)(\x07)\002?')         # Operating System Command

entities/definitions/commands/tokenize_sentences.py:111
  /data-processing/entities/definitions/commands/tokenize_sentences.py:111: DeprecationWarning: invalid escape sequence \s
    "EQUATION_DEPTH_0_START\s*(.*?)\s*EQUATION_DEPTH_0_END",

entities/glossary_terms/colorize.py:24
  /data-processing/entities/glossary_terms/colorize.py:24: DeprecationWarning: invalid escape sequence \S
    first_nonspace = re.search("\S", term.tex)

entities/glossary_terms/colorize.py:32
  /data-processing/entities/glossary_terms/colorize.py:32: DeprecationWarning: invalid escape sequence \S
    last_nonspace = re.search("\S(?=\s*$)", term.tex)

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
  /usr/local/lib/python3.7/dist-packages/catalogue.py:138: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
  /usr/local/lib/python3.7/dist-packages/catalogue.py:126: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=================================================================================== 11 passed, 4 skipped, 199 deselected, 27 warnings in 164.00s (0:02:43) ===================================================================================
...
# pytest
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items                                                                                                                                                                                                                          

tests/test_bounding_box.py ..................                                                                                                                                                                                          [  8%]
tests/test_colorize_sentences.py .....                                                                                                                                                                                                 [ 10%]
tests/test_colorize_tex.py ........                                                                                                                                                                                                    [ 14%]
tests/test_compile.py ........                                                                                                                                                                                                         [ 18%]
tests/test_extract_definitions.py ssss...ss...sssssssss                                                                                                                                                                                [ 28%]
tests/test_locate_symbols.py ...                                                                                                                                                                                                       [ 29%]
tests/test_match_symbols.py ......                                                                                                                                                                                                     [ 32%]
tests/test_normalize_tex.py ..............                                                                                                                                                                                             [ 38%]
tests/test_parse_equation.py .........................                                                                                                                                                                                 [ 50%]
tests/test_parse_tex.py ..............................................                                                                                                                                                                 [ 71%]
tests/test_sanitize_equation.py ....                                                                                                                                                                                                   [ 73%]
tests/test_scan_tex.py .....                                                                                                                                                                                                           [ 76%]
tests/test_string.py .............                                                                                                                                                                                                     [ 82%]
tests/test_unpack.py ....                                                                                                                                                                                                              [ 84%]
tests/test_visual_validate.py ......                                                                                                                                                                                                   [ 86%]
tests/common/test_fetch_arxiv.py ......                                                                                                                                                                                                [ 89%]
tests/common/test_upload_entities.py ............                                                                                                                                                                                      [ 95%]
tests/common/commands/test_fetch_arxiv_sources.py ..                                                                                                                                                                                   [ 96%]
tests/common/commands/test_fetch_s2_data.py ........                                                                                                                                                                                   [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================================================ 199 passed, 15 skipped, 4 warnings in 2.09s =================================================================================================

Note that warnings also occur off the current branch:

# pytest -m slow
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items / 199 deselected / 15 selected                                                                                                                                                                                           

tests/test_extract_definitions.py ...........ssss                                                                                                                                                                                      [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

/usr/lib/python3/dist-packages/colorama/ansitowin32.py:49
  /usr/lib/python3/dist-packages/colorama/ansitowin32.py:49: DeprecationWarning: invalid escape sequence \[
    ANSI_CSI_RE = re.compile('\001?\033\[((?:\d|;)*)([a-zA-Z])\002?')     # Control Sequence Introducer

/usr/lib/python3/dist-packages/colorama/ansitowin32.py:50
  /usr/lib/python3/dist-packages/colorama/ansitowin32.py:50: DeprecationWarning: invalid escape sequence \]
    ANSI_OSC_RE = re.compile('\001?\033\]((?:.|;)*?)(\x07)\002?')         # Operating System Command

entities/definitions/commands/tokenize_sentences.py:111
  /data-processing/entities/definitions/commands/tokenize_sentences.py:111: DeprecationWarning: invalid escape sequence \s
    "EQUATION_DEPTH_0_START\s*(.*?)\s*EQUATION_DEPTH_0_END",

entities/glossary_terms/colorize.py:24
  /data-processing/entities/glossary_terms/colorize.py:24: DeprecationWarning: invalid escape sequence \S
    first_nonspace = re.search("\S", term.tex)

entities/glossary_terms/colorize.py:32
  /data-processing/entities/glossary_terms/colorize.py:32: DeprecationWarning: invalid escape sequence \S
    last_nonspace = re.search("\S(?=\s*$)", term.tex)

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
  /usr/local/lib/python3.7/dist-packages/catalogue.py:138: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
  /usr/local/lib/python3.7/dist-packages/catalogue.py:126: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=================================================================================== 11 passed, 4 skipped, 199 deselected, 27 warnings in 174.75s (0:02:54) ===================================================================================
...
# pytest
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items                                                                                                                                                                                                                          

tests/test_bounding_box.py ..................                                                                                                                                                                                          [  8%]
tests/test_colorize_sentences.py .....                                                                                                                                                                                                 [ 10%]
tests/test_colorize_tex.py ........                                                                                                                                                                                                    [ 14%]
tests/test_compile.py ........                                                                                                                                                                                                         [ 18%]
tests/test_extract_definitions.py ssss...ss...sssssssss                                                                                                                                                                                [ 28%]
tests/test_locate_symbols.py ...                                                                                                                                                                                                       [ 29%]
tests/test_match_symbols.py ......                                                                                                                                                                                                     [ 32%]
tests/test_normalize_tex.py ..............                                                                                                                                                                                             [ 38%]
tests/test_parse_equation.py .........................                                                                                                                                                                                 [ 50%]
tests/test_parse_tex.py ..............................................                                                                                                                                                                 [ 71%]
tests/test_sanitize_equation.py ....                                                                                                                                                                                                   [ 73%]
tests/test_scan_tex.py .....                                                                                                                                                                                                           [ 76%]
tests/test_string.py .............                                                                                                                                                                                                     [ 82%]
tests/test_unpack.py ....                                                                                                                                                                                                              [ 84%]
tests/test_visual_validate.py ......                                                                                                                                                                                                   [ 86%]
tests/common/test_fetch_arxiv.py ......                                                                                                                                                                                                [ 89%]
tests/common/test_upload_entities.py ............                                                                                                                                                                                      [ 95%]
tests/common/commands/test_fetch_arxiv_sources.py ..                                                                                                                                                                                   [ 96%]
tests/common/commands/test_fetch_s2_data.py ........                                                                                                                                                                                   [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================================================ 199 passed, 15 skipped, 4 warnings in 2.11s =================================================================================================
kyleclo commented 2 years ago

@andrewhead FYI