allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
416 stars 52 forks source link

Include bib entries with no s2 paper matches in citations output #353

Closed ca16 closed 2 years ago

ca16 commented 2 years ago

Related to https://github.com/allenai/scholar/issues/31778, builds on https://github.com/allenai/scholarphi/pull/351.

The idea is to include output for bib entries that we failed to match to S2 papers in the output of the citations pipeline, so that we can still identify the corresponding mention bounding boxes (and possible show the bib entry text n them instead).

Testing

Running examples

I tested this out by running a couple of papers through (same as the ones for #351).

A paper missing matches, without these changes

Output file: 1611.07004v3-current-2.txt

Logs:

# python scripts/run_pipeline.py    --one-paper-at-a-time    --arxiv-ids 1611.07004v3    --output-forms file    --output-dir outputs-1611.07004v3-current-2    --entities citations
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2022-04-27 19:19:06,158 [INFO]: The following commands will be run, in this order: ['fetch-arxiv-sources', 'fetch-s2-metadata', 'unpack-sources', 'compile-tex', 'normalize-tex', 'compile-normalized-tex', 'raster-pages', 'extract-bibitems', 'resolve-bibitems', 'locate-bounding-boxes-for-citation-fragments', 'locate-citations', 'upload-citations']
2022-04-27 19:19:06,158 [INFO]: Running pipeline for paper 1611.07004v3
2022-04-27 19:19:06,159 [INFO]: Launching command fetch-arxiv-sources
2022-04-27 19:19:27,732 [INFO]: Finished running command fetch-arxiv-sources
2022-04-27 19:19:27,733 [INFO]: Launching command fetch-s2-metadata
2022-04-27 19:19:27,734 [INFO]: Issuing request to S2 @ api.semanticscholar.org
2022-04-27 19:19:36,160 [INFO]: Finished running command fetch-s2-metadata
2022-04-27 19:19:36,161 [INFO]: Launching command unpack-sources
2022-04-27 19:19:36,408 [INFO]: Finished running command unpack-sources
2022-04-27 19:19:36,409 [INFO]: Launching command compile-tex
2022-04-27 19:19:39,639 [INFO]: Finished running command compile-tex
2022-04-27 19:19:39,641 [INFO]: Launching command normalize-tex
2022-04-27 19:19:39,735 [INFO]: Finished running command normalize-tex
2022-04-27 19:19:39,736 [INFO]: Launching command compile-normalized-tex
2022-04-27 19:19:42,860 [INFO]: Finished running command compile-normalized-tex
2022-04-27 19:19:42,861 [INFO]: Launching command raster-pages
2022-04-27 19:19:44,506 [INFO]: Finished running command raster-pages
2022-04-27 19:19:44,507 [INFO]: Launching command extract-bibitems
2022-04-27 19:19:45,310 [INFO]: Finished running command extract-bibitems
2022-04-27 19:19:45,311 [INFO]: Launching command resolve-bibitems
2022-04-27 19:19:45,316 [WARNING]: Could not find a sufficiently similar reference for bibitem sketch2pokemon of paper 1611.07004v3
2022-04-27 19:19:45,318 [WARNING]: Could not find a sufficiently similar reference for bibitem pose of paper 1611.07004v3
2022-04-27 19:19:45,320 [WARNING]: Could not find a sufficiently similar reference for bibitem edges2cats of paper 1611.07004v3
2022-04-27 19:19:45,323 [WARNING]: Could not find a sufficiently similar reference for bibitem fotogenerator of paper 1611.07004v3
2022-04-27 19:19:45,325 [WARNING]: Could not find a sufficiently similar reference for bibitem palette of paper 1611.07004v3
2022-04-27 19:19:45,327 [WARNING]: Could not find a sufficiently similar reference for bibitem background of paper 1611.07004v3
2022-04-27 19:19:45,329 [WARNING]: Could not find a sufficiently similar reference for bibitem sketch2portrait of paper 1611.07004v3
2022-04-27 19:19:45,330 [WARNING]: Could not find a sufficiently similar reference for bibitem sunday of paper 1611.07004v3
2022-04-27 19:19:45,473 [INFO]: Paper has 65 references, able to match 57 reference(s) to S2 data.
2022-04-27 19:19:45,473 [INFO]: Finished running command resolve-bibitems
2022-04-27 19:19:45,473 [INFO]: Launching command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:20:13,767 [INFO]: Finished running command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:20:13,767 [INFO]: Launching command locate-citations
2022-04-27 19:20:13,788 [INFO]: Finished running command locate-citations
2022-04-27 19:20:13,789 [INFO]: We will be writing output to files.
2022-04-27 19:20:13,789 [INFO]: Launching command upload-citations
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key sketch2pokemon because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key pose because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key edges2cats because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key fotogenerator because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key palette because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key background because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key sketch2portrait because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,798 [WARNING]: Not uploading bounding box information for citation with key sunday because it was not resolved to a paper S2 ID.
2022-04-27 19:20:13,799 [INFO]: Saving to file...
2022-04-27 19:20:13,799 [INFO]: About to write 119 entity infos to outputs-1611.07004v3-current-2/citations.json (version: v0).
2022-04-27 19:20:13,805 [INFO]: Finished running command upload-citations
2022-04-27 19:20:13,841 [INFO]: Internal process exited

Mentions in output:

$ cat 1611.07004v3-current-2.json | jq ' .data | map(.id_) | .[]'  | wc -l
     119

A paper missing matches, with these changes

Output file: 1611.07004v3-with-missing-matches-2.txt

Logs:

# python scripts/run_pipeline.py    --one-paper-at-a-time    --arxiv-ids 1611.07004v3    --output-forms file    --output-dir outputs-1611.07004v3-with-missing-matches-2    --entities citations
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2022-04-27 19:09:20,719 [INFO]: The following commands will be run, in this order: ['fetch-arxiv-sources', 'fetch-s2-metadata', 'unpack-sources', 'compile-tex', 'normalize-tex', 'compile-normalized-tex', 'raster-pages', 'extract-bibitems', 'resolve-bibitems', 'locate-bounding-boxes-for-citation-fragments', 'locate-citations', 'upload-citations']
2022-04-27 19:09:20,719 [INFO]: Running pipeline for paper 1611.07004v3
2022-04-27 19:09:20,720 [INFO]: Launching command fetch-arxiv-sources
2022-04-27 19:09:42,544 [INFO]: Finished running command fetch-arxiv-sources
2022-04-27 19:09:42,545 [INFO]: Launching command fetch-s2-metadata
2022-04-27 19:09:42,545 [INFO]: Issuing request to S2 @ api.semanticscholar.org
2022-04-27 19:09:50,283 [INFO]: Finished running command fetch-s2-metadata
2022-04-27 19:09:50,284 [INFO]: Launching command unpack-sources
2022-04-27 19:09:50,537 [INFO]: Finished running command unpack-sources
2022-04-27 19:09:50,538 [INFO]: Launching command compile-tex
2022-04-27 19:09:53,928 [INFO]: Finished running command compile-tex
2022-04-27 19:09:53,930 [INFO]: Launching command normalize-tex
2022-04-27 19:09:54,022 [INFO]: Finished running command normalize-tex
2022-04-27 19:09:54,022 [INFO]: Launching command compile-normalized-tex
2022-04-27 19:09:57,314 [INFO]: Finished running command compile-normalized-tex
2022-04-27 19:09:57,315 [INFO]: Launching command raster-pages
2022-04-27 19:09:59,130 [INFO]: Finished running command raster-pages
2022-04-27 19:09:59,131 [INFO]: Launching command extract-bibitems
2022-04-27 19:09:59,858 [INFO]: Finished running command extract-bibitems
2022-04-27 19:09:59,858 [INFO]: Launching command resolve-bibitems
2022-04-27 19:09:59,864 [WARNING]: Could not find a sufficiently similar reference for bibitem sketch2pokemon of paper 1611.07004v3
2022-04-27 19:09:59,866 [WARNING]: Could not find a sufficiently similar reference for bibitem pose of paper 1611.07004v3
2022-04-27 19:09:59,868 [WARNING]: Could not find a sufficiently similar reference for bibitem edges2cats of paper 1611.07004v3
2022-04-27 19:09:59,871 [WARNING]: Could not find a sufficiently similar reference for bibitem fotogenerator of paper 1611.07004v3
2022-04-27 19:09:59,872 [WARNING]: Could not find a sufficiently similar reference for bibitem palette of paper 1611.07004v3
2022-04-27 19:09:59,874 [WARNING]: Could not find a sufficiently similar reference for bibitem background of paper 1611.07004v3
2022-04-27 19:09:59,876 [WARNING]: Could not find a sufficiently similar reference for bibitem sketch2portrait of paper 1611.07004v3
2022-04-27 19:09:59,878 [WARNING]: Could not find a sufficiently similar reference for bibitem sunday of paper 1611.07004v3
2022-04-27 19:10:00,012 [INFO]: Paper has 65 references, able to match 57 reference(s) to S2 data.
2022-04-27 19:10:00,013 [INFO]: Finished running command resolve-bibitems
2022-04-27 19:10:00,013 [INFO]: Launching command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:10:27,876 [INFO]: Finished running command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:10:27,876 [INFO]: Launching command locate-citations
2022-04-27 19:10:27,898 [INFO]: Finished running command locate-citations
2022-04-27 19:10:27,899 [INFO]: We will be writing output to files.
2022-04-27 19:10:27,899 [INFO]: Launching command upload-citations
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key sketch2pokemon for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key pose for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key edges2cats for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key fotogenerator for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key palette for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key background for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key sketch2portrait for paper 1611.07004v3
2022-04-27 19:10:27,910 [WARNING]: Missing S2 match for bibitem with key sunday for paper 1611.07004v3
2022-04-27 19:10:27,911 [INFO]: Saving to file...
2022-04-27 19:10:27,911 [INFO]: About to write 127 entity infos to outputs-1611.07004v3-with-missing-matches-2/citations.json (version: v0).
2022-04-27 19:10:27,918 [INFO]: Finished running command upload-citations
2022-04-27 19:10:27,953 [INFO]: Internal process exited

Mentions in output:

$ cat 1611.07004v3-with-missing-matches-2.json | jq ' .data | map(.id_) | .[]'  | wc -l
     127

A paper not missing matches, without these changes

Output file: 2009.12303v4-current-2.txt

Logs:

# python scripts/run_pipeline.py    --one-paper-at-a-time    --arxiv-ids 2009.12303v4    --output-forms file    --output-dir outputs-2009.12303v4-current-2    --entities citations
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2022-04-27 19:21:12,842 [INFO]: The following commands will be run, in this order: ['fetch-arxiv-sources', 'fetch-s2-metadata', 'unpack-sources', 'compile-tex', 'normalize-tex', 'compile-normalized-tex', 'raster-pages', 'extract-bibitems', 'resolve-bibitems', 'locate-bounding-boxes-for-citation-fragments', 'locate-citations', 'upload-citations']
2022-04-27 19:21:12,843 [INFO]: Running pipeline for paper 2009.12303v4
2022-04-27 19:21:12,843 [INFO]: Launching command fetch-arxiv-sources
2022-04-27 19:21:23,888 [INFO]: Finished running command fetch-arxiv-sources
2022-04-27 19:21:23,890 [INFO]: Launching command fetch-s2-metadata
2022-04-27 19:21:23,890 [INFO]: Issuing request to S2 @ api.semanticscholar.org
2022-04-27 19:21:27,177 [INFO]: Finished running command fetch-s2-metadata
2022-04-27 19:21:27,178 [INFO]: Launching command unpack-sources
2022-04-27 19:21:27,218 [INFO]: Finished running command unpack-sources
2022-04-27 19:21:27,219 [INFO]: Launching command compile-tex
2022-04-27 19:21:32,876 [INFO]: Finished running command compile-tex
2022-04-27 19:21:32,877 [INFO]: Launching command normalize-tex
2022-04-27 19:21:32,889 [INFO]: Finished running command normalize-tex
2022-04-27 19:21:32,889 [INFO]: Launching command compile-normalized-tex
2022-04-27 19:21:38,265 [INFO]: Finished running command compile-normalized-tex
2022-04-27 19:21:38,266 [INFO]: Launching command raster-pages
2022-04-27 19:21:39,182 [INFO]: Finished running command raster-pages
2022-04-27 19:21:39,183 [INFO]: Launching command extract-bibitems
2022-04-27 19:21:40,358 [INFO]: Finished running command extract-bibitems
2022-04-27 19:21:40,359 [INFO]: Launching command resolve-bibitems
2022-04-27 19:21:40,578 [INFO]: Paper has 55 references, able to match 55 reference(s) to S2 data.
2022-04-27 19:21:40,578 [INFO]: Finished running command resolve-bibitems
2022-04-27 19:21:40,578 [INFO]: Launching command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:22:06,946 [INFO]: Finished running command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:22:06,947 [INFO]: Launching command locate-citations
2022-04-27 19:22:06,967 [INFO]: Finished running command locate-citations
2022-04-27 19:22:06,967 [INFO]: We will be writing output to files.
2022-04-27 19:22:06,967 [INFO]: Launching command upload-citations
2022-04-27 19:22:06,977 [INFO]: Saving to file...
2022-04-27 19:22:06,977 [INFO]: About to write 102 entity infos to outputs-2009.12303v4-current-2/citations.json (version: v0).
2022-04-27 19:22:06,983 [INFO]: Finished running command upload-citations
2022-04-27 19:22:06,997 [INFO]: Internal process exited

Mentions in output:

$ cat 2009.12303v4-current-2.json | jq ' .data | map(.id_) | .[]'  | wc -l
     102

A paper not missing matches, with these changes

Output file: 2009.12303v4-with-missing-matches-2.txt

Logs:

python scripts/run_pipeline.py    --one-paper-at-a-time    --arxiv-ids 2009.12303v4    --output-forms file    --output-dir outputs-2009.12303v4-with-missing-matches-2    --entities citations
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2022-04-27 19:11:17,538 [INFO]: The following commands will be run, in this order: ['fetch-arxiv-sources', 'fetch-s2-metadata', 'unpack-sources', 'compile-tex', 'normalize-tex', 'compile-normalized-tex', 'raster-pages', 'extract-bibitems', 'resolve-bibitems', 'locate-bounding-boxes-for-citation-fragments', 'locate-citations', 'upload-citations']
2022-04-27 19:11:17,538 [INFO]: Running pipeline for paper 2009.12303v4
2022-04-27 19:11:17,539 [INFO]: Launching command fetch-arxiv-sources
2022-04-27 19:11:28,838 [INFO]: Finished running command fetch-arxiv-sources
2022-04-27 19:11:28,840 [INFO]: Launching command fetch-s2-metadata
2022-04-27 19:11:28,840 [INFO]: Issuing request to S2 @ api.semanticscholar.org
2022-04-27 19:11:32,156 [INFO]: Finished running command fetch-s2-metadata
2022-04-27 19:11:32,157 [INFO]: Launching command unpack-sources
2022-04-27 19:11:32,203 [INFO]: Finished running command unpack-sources
2022-04-27 19:11:32,204 [INFO]: Launching command compile-tex
2022-04-27 19:11:37,973 [INFO]: Finished running command compile-tex
2022-04-27 19:11:37,974 [INFO]: Launching command normalize-tex
2022-04-27 19:11:37,993 [INFO]: Finished running command normalize-tex
2022-04-27 19:11:37,994 [INFO]: Launching command compile-normalized-tex
2022-04-27 19:11:43,336 [INFO]: Finished running command compile-normalized-tex
2022-04-27 19:11:43,337 [INFO]: Launching command raster-pages
2022-04-27 19:11:44,142 [INFO]: Finished running command raster-pages
2022-04-27 19:11:44,144 [INFO]: Launching command extract-bibitems
2022-04-27 19:11:45,245 [INFO]: Finished running command extract-bibitems
2022-04-27 19:11:45,246 [INFO]: Launching command resolve-bibitems
2022-04-27 19:11:45,452 [INFO]: Paper has 55 references, able to match 55 reference(s) to S2 data.
2022-04-27 19:11:45,452 [INFO]: Finished running command resolve-bibitems
2022-04-27 19:11:45,453 [INFO]: Launching command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:12:12,297 [INFO]: Finished running command locate-bounding-boxes-for-citation-fragments
2022-04-27 19:12:12,298 [INFO]: Launching command locate-citations
2022-04-27 19:12:12,317 [INFO]: Finished running command locate-citations
2022-04-27 19:12:12,317 [INFO]: We will be writing output to files.
2022-04-27 19:12:12,317 [INFO]: Launching command upload-citations
2022-04-27 19:12:12,328 [INFO]: Saving to file...
2022-04-27 19:12:12,328 [INFO]: About to write 102 entity infos to outputs-2009.12303v4-with-missing-matches-2/citations.json (version: v0).
2022-04-27 19:12:12,335 [INFO]: Finished running command upload-citations
2022-04-27 19:12:12,349 [INFO]: Internal process exited

Mentions in output:

$ cat 2009.12303v4-with-missing-matches-2.json | jq ' .data | map(.id_) | .[]'  | wc -l
     102

Automated tests

I also ran the tests described here:

# pytest
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items                                                                                                                                                                                                                          

tests/test_bounding_box.py ..................                                                                                                                                                                                          [  8%]
tests/test_colorize_sentences.py .....                                                                                                                                                                                                 [ 10%]
tests/test_colorize_tex.py ........                                                                                                                                                                                                    [ 14%]
tests/test_compile.py ........                                                                                                                                                                                                         [ 18%]
tests/test_extract_definitions.py ssss...ss...sssssssss                                                                                                                                                                                [ 28%]
tests/test_locate_symbols.py ...                                                                                                                                                                                                       [ 29%]
tests/test_match_symbols.py ......                                                                                                                                                                                                     [ 32%]
tests/test_normalize_tex.py ..............                                                                                                                                                                                             [ 38%]
tests/test_parse_equation.py .........................                                                                                                                                                                                 [ 50%]
tests/test_parse_tex.py ..............................................                                                                                                                                                                 [ 71%]
tests/test_sanitize_equation.py ....                                                                                                                                                                                                   [ 73%]
tests/test_scan_tex.py .....                                                                                                                                                                                                           [ 76%]
tests/test_string.py .............                                                                                                                                                                                                     [ 82%]
tests/test_unpack.py ....                                                                                                                                                                                                              [ 84%]
tests/test_visual_validate.py ......                                                                                                                                                                                                   [ 86%]
tests/common/test_fetch_arxiv.py ......                                                                                                                                                                                                [ 89%]
tests/common/test_upload_entities.py ............                                                                                                                                                                                      [ 95%]
tests/common/commands/test_fetch_arxiv_sources.py ..                                                                                                                                                                                   [ 96%]
tests/common/commands/test_fetch_s2_data.py ........                                                                                                                                                                                   [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================================================ 199 passed, 15 skipped, 4 warnings in 2.10s =================================================================================================
root@ac4354aaa2ba:/data-processing# pytest -m slow
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.5, pytest-5.3.1, py-1.11.0, pluggy-0.13.1
rootdir: /data-processing, inifile: pytest.ini, testpaths: tests
plugins: cov-2.5.1
collected 214 items / 199 deselected / 15 selected                                                                                                                                                                                           

tests/test_extract_definitions.py ...........ssss                                                                                                                                                                                      [100%]

============================================================================================================== warnings summary ==============================================================================================================
/usr/lib/python3/dist-packages/urllib3/util/selectors.py:14
  /usr/lib/python3/dist-packages/urllib3/util/selectors.py:14: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping

/usr/lib/python3/dist-packages/urllib3/_collections.py:2
  /usr/lib/python3/dist-packages/urllib3/_collections.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import Mapping, MutableMapping

/usr/local/lib/python3.7/dist-packages/wandb/util.py:37
  /usr/local/lib/python3.7/dist-packages/wandb/util.py:37: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

/usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /usr/local/lib/python3.7/dist-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_simple_definitions
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_before_symbol
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extracts_nickname_symbol_filter
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_extract_abbreviation_acronym
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_model_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
tests/test_extract_definitions.py::test_extract_abbreviation_shortened_word
  /usr/local/lib/python3.7/dist-packages/catalogue.py:138: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

tests/test_extract_definitions.py::test_model_extracts_simple_definitions
  /usr/local/lib/python3.7/dist-packages/catalogue.py:126: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=================================================================================== 11 passed, 4 skipped, 199 deselected, 22 warnings in 168.38s (0:02:48) ===================================================================================
Mosqidiot commented 2 years ago

Overall, it looks good to me. Just curious, maybe I missed this part, is there any testing has been done to make sure that missing the key "paper_id" will not crash anything downstream? Maybe it is part of the pytest?

ca16 commented 2 years ago

@Mosqidiot I put some notes about some testing I did around this for the old reader/scholarphi reader here: https://github.com/allenai/scholar/issues/31778#issuecomment-1104382940. Things seemed okay to me (though @kyleclo thinks we might not even necessarily care about that now that we have the new reader so maybe that doesn't matter at all).

For the new reader, making this change alone will not affect what the frontend gets from s2airs. I've split up the remaining work in roughly two:

  1. Get the rest of the backend pipeline from here through s2airs set up. As part of that, ensure that the s2airs API does not include citations/bib entries that are unmatched. (https://github.com/allenai/scholar/issues/31779)
  2. Add bib entry text and bib entries/citations missing matches to s2airs API output, after coordinating with new reader frontend people. (https://github.com/allenai/scholar/issues/31811)
ca16 commented 2 years ago

FYI @kyleclo @andrewhead