allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Refactor grobid sections #281

Closed geli-gel closed 8 months ago

geli-gel commented 8 months ago

Solution to https://github.com/allenai/scholar/issues/38452

Refactored the way Grobid sections/paragraphs/sentences are annotated onto the doc to reduce SpanGroup overlap errors

This involved refactoring the "sections" section of the code to only generate spangroups from sentences and headings (since those are the boxgroups Grobid provides) and using tuples of [optional[heading], [list of paragraphs[list of sentences]]] to make the hierarchical section/paragraph spangroups instead of trying to make huge box lists for each piece as originally written. This made it easier to pinpoint the source of SpanGroup overlap errors.

A necessary update was to make _box_groups_to_span_groups keep track of which tokens were already allocated in previous sentences, since we loop through by Grobid paragraph tags, and we were sometimes overlapping between paragraphs.

example of previously failing (now passing) pdf grobid xml has a ".":

image

the actual PDF does not:

image

we end up with overlapping spangroups A:

image

and B:

image

So, it seems that both Grobid and PDFPlumber are possibly mistaking the dot of the "i" in the line below as a "." in the line in question.

And Grobid splits into a new "paragraph" there. but both sentence boxes grab that ".".


Another update to _box_groups_to_span_groups that prevents SpanGroup overlaps of a different type (when MergeSpans merges token spans encompassing already allocated tokens) was also added in.


Also added a fix for the "missing attribute: 'coords'" that was also contributing to the overall failure rate.


Ran this on a list of past test PDFs, as well as recently failed PDFs from spp prod logs and they all pass (snippet from jupyter notebook includes my personal debugging comments...)

sha = '74da5d99e7d951f0dc9c3111186b22544a18bff5' # spangroups overlapping at paragraphs -- passes!
sha = '43659b55f75e3b2ea626bfc8eeea80afa3798c97' # spangroups overlapping at sections -- passes!
sha = 'ade545fda5015a8aac957a69a126da55451ff016' # spangroups overlapping at sections -- passes!
sha = '59e4c0ecfdcbaa651ca2c40625817bb83a9af4c3' # spangroups overlapping at sections -- passes!
sha = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54' # overlap at (27919, 28096, 9), (27952, 27953, 10) -- passes!
sha = '1d1d7702cc4aaa3f66c29d4eb5ac023091d601e0' # this one's effed up no paragraphs -- successful but no sentences YET does have mentions ??? -- passes!!!!

sha = '121e30c48546e671dc5e16c694c5e69b392cf8fb' # OG experimental paper (3 pager), wondering if takase et al ref is part of sentence... should be! -- yep, it is. Passes!!
sha = 'e5910c027af0ee9c1901c57f6579d903aedee7f4' # test paper for test_grobid_augment_existing...

# sha = '32ff296b592d9cb69c88e239c8e80c7cc5cb3207' # this one has weird stuff for in-between text deciphering -- passes though!
# sha = '2423065e82ffbeb15353517cd8ceed9b168f039d' # a successful one -- has nothing of the sort (GOOD) -- still works after sections refactor! -- great
# sha = 'd55e9255deeb98ca2db55cd2e9bfac22774a2c32' # messed up weird mention not found in section -- 
# sha = '7535981e48c5cccd4d101895b2a350f114d25f5f' # ok maybe better same as above
# sha = 'b936dc63ad9a1380537b0bcc889c92b6af00431e' # sentence doesn't have coords? let's see... -- passes!
# sha = '6ae0afceaaa55ac6d4ec9b5b321f9aa1334b0429' # coords.. -- passes
# sha = '0706021a12b2d74eb5f9fd2f5dc187581a8c66a5' # i jus wanna see "after discussing the risks and benefits" section if 30 is there
# sha = '63929df1d44cec7b407d063f222fcc64e3de2ad3' # dikken et al 2012 section 0 -- only 2012 is there  --- is it the mention? or the section sentence? -- NOICE. added pad_x on sentences
# sha = '51c96902345101a9f2108749ad96d869e595548d' # is it the same in grobid xml, missing numbers/years? of refs?
# sha = '6bb4b89a1dd3bb03a3a2523a2e7867c1bb73a52a' # i jus wana picture -- no i also want grobid boxes drawn (BAD)
# sha = '304e2a42e897aa728d394e2d1e60ea26f4f1c101' # is abstract just ","?? -- no.
# sha = '8dd9ac4f26bee54cf1ee85c50fd63a1f44555fd1' # let's see a recently failed one -- WORKS!! IRL it's a straight up UGLY PDF.
# sha = '2c7f2e6f481873f72c9477e6d5447d1715668da6' 
# sha = 'dfbd16a81af6763d77696a620263295c2ea230f4'
# sha = 'b1e7e7df5aa502a2922ede6325e9aae2f14f6b71'
# sha = '383cfcef25477da08c86f96f5abcd7f796a1b51e'
# sha = '8a408271a1a2226163a579499905a0f4752b5085'
# sha = 'd5be1893584b41f6b567a6ac1a4d7676d80b0b98' # works! previously failed? i think?

TODO:

geli-gel commented 8 months ago

Chris reminds me that box_groups_to_span_groups is used by anything that uses .annotate with boxgroups.

tt verify on figure-tables passes ✅ the test: https://github.com/allenai/mmda/blob/c28a17f50c0ed68f3d973398a0e1969b6a148797/src/ai2_internal/figure_table_predictors/integration_test.py#L88

(mmda) angelez@ip-10-0-0-231 mmda % tt verify
Usage: tt verify [OPTIONS]
Try 'tt verify --help' for help.

Error: Missing option '--config-file' / '-c'.
(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml    

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 8
Using selected option: figure_table_predictors
...
 => => naming to docker.io/library/figure_table_predictors__timo-server                                                                                                                                                       0.0s
============================= test session starts ==============================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: anyio-3.7.1
collected 12 items

test_entrypoint.py .....                                                 [ 41%]
integration_tests/test_runner.py ..                                      [ 58%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 12 passed, 1 warning in 15.76s ========================

tt verify on bibpredictor cause https://github.com/allenai/mmda/blob/cab36b6532125a28c19831d63f7e7a9321633701/src/ai2_internal/bibentry_detection_predictor/interface.py#L105 , passes ✅ the test: https://github.com/allenai/mmda/blob/5d853d7d0aa6932f77cf2303d73f33d7aaa1d4a3/src/ai2_internal/bibentry_detection_predictor/integration_test.py#L147

(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml

Choose a variant by name or number: 
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 7
Using selected option: bibentry_detection_predictor
...
 => => naming to docker.io/library/bibentry_detection_predictor__timo-server                                                                                                                                                  0.0s
x ./
x ./archive/
x ./archive/config.yaml
x ./archive/model_final.pth
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: hydra-core-1.3.2, anyio-3.7.1
collected 13 items

test_entrypoint.py .....                                                 [ 38%]
integration_tests/test_runner.py ...                                     [ 61%]
server/test_invocation_sampler.py .....                                  [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46
  /usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46: DeprecationWarning: LINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use BILINEAR or Resampling.BILINEAR instead.
    def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):

../../../usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62
  /usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
    warnings.warn('No PKG-INFO found for package: %s' % self.package_name)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 13 passed, 2 warnings in 797.48s (0:13:17) ==================

as for the spangroup overlap errors that arose in VILA (and actually came from LayoutParser, but don't anymore as of this change: https://github.com/allenai/mmda/pull/236/files#) -- you can see we used to .annotate(blocks=[layoutparser BoxGroups]) which would activate _box_groups_to_span_groups. This comment shows what LayoutParser BoxGroups look like (and explains why they cause SpanGroup overlaps): https://github.com/allenai/scholar/issues/36351#issuecomment-1584986899 This change would mask those errors, and we'd end up with Spanless SpanGroups with BoxGroups. No errors, but probably not the best result anyway. ❌

Might be better to just have boxes, which I think is what our dream "Entity" allows (these make more sense as just boxes) however someone annotating boxgroups onto a doc expecting SpanGroups might not want this result.