Closed geli-gel closed 8 months ago
Chris reminds me that box_groups_to_span_groups
is used by anything that uses .annotate
with boxgroups.
tt verify on figure-tables passes ✅ the test: https://github.com/allenai/mmda/blob/c28a17f50c0ed68f3d973398a0e1969b6a148797/src/ai2_internal/figure_table_predictors/integration_test.py#L88
(mmda) angelez@ip-10-0-0-231 mmda % tt verify
Usage: tt verify [OPTIONS]
Try 'tt verify --help' for help.
Error: Missing option '--config-file' / '-c'.
(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml
Choose a variant by name or number:
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 8
Using selected option: figure_table_predictors
...
=> => naming to docker.io/library/figure_table_predictors__timo-server 0.0s
============================= test session starts ==============================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: anyio-3.7.1
collected 12 items
test_entrypoint.py ..... [ 41%]
integration_tests/test_runner.py .. [ 58%]
server/test_invocation_sampler.py ..... [100%]
=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62
/usr/local/lib/python3.8/site-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
warnings.warn('No PKG-INFO found for package: %s' % self.package_name)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 12 passed, 1 warning in 15.76s ========================
tt verify on bibpredictor cause https://github.com/allenai/mmda/blob/cab36b6532125a28c19831d63f7e7a9321633701/src/ai2_internal/bibentry_detection_predictor/interface.py#L105 , passes ✅ the test: https://github.com/allenai/mmda/blob/5d853d7d0aa6932f77cf2303d73f33d7aaa1d4a3/src/ai2_internal/bibentry_detection_predictor/integration_test.py#L147
(mmda) angelez@ip-10-0-0-231 mmda % tt verify -c src/ai2_internal/config.yaml
Choose a variant by name or number:
1. ivila-row-layoutlm-finetuned-s2vl-v2
2. layout_parser
3. bibentry_predictor
4. bibentry_predictor_mmda
5. citation_mentions
6. citation_links
7. bibentry_detection_predictor
8. figure_table_predictors
9. dwp-heuristic
10. svm-word-predictor
> 7
Using selected option: bibentry_detection_predictor
...
=> => naming to docker.io/library/bibentry_detection_predictor__timo-server 0.0s
x ./
x ./archive/
x ./archive/config.yaml
x ./archive/model_final.pth
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.4.3, pluggy-1.3.0
rootdir: /opt/ml/code
plugins: hydra-core-1.3.2, anyio-3.7.1
collected 13 items
test_entrypoint.py ..... [ 38%]
integration_tests/test_runner.py ... [ 61%]
server/test_invocation_sampler.py ..... [100%]
=============================== warnings summary ===============================
../../../usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46
/usr/local/lib/python3.8/dist-packages/detectron2/data/transforms/transform.py:46: DeprecationWarning: LINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use BILINEAR or Resampling.BILINEAR instead.
def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):
../../../usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62
/usr/local/lib/python3.8/dist-packages/pkginfo/installed.py:62: UserWarning: No PKG-INFO found for package: workingdir
warnings.warn('No PKG-INFO found for package: %s' % self.package_name)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 13 passed, 2 warnings in 797.48s (0:13:17) ==================
as for the spangroup overlap errors that arose in VILA (and actually came from LayoutParser, but don't anymore as of this change: https://github.com/allenai/mmda/pull/236/files#) -- you can see we used to .annotate(blocks=[layoutparser BoxGroups])
which would activate _box_groups_to_span_groups
.
This comment shows what LayoutParser BoxGroups look like (and explains why they cause SpanGroup overlaps):
https://github.com/allenai/scholar/issues/36351#issuecomment-1584986899
This change would mask those errors, and we'd end up with Spanless SpanGroups with BoxGroups. No errors, but probably not the best result anyway. ❌
Might be better to just have boxes, which I think is what our dream "Entity" allows (these make more sense as just boxes) however someone annotating boxgroups onto a doc expecting SpanGroups might not want this result.
Solution to https://github.com/allenai/scholar/issues/38452
Refactored the way Grobid sections/paragraphs/sentences are annotated onto the doc to reduce SpanGroup overlap errors
This involved refactoring the "sections" section of the code to only generate spangroups from sentences and headings (since those are the boxgroups Grobid provides) and using tuples of [optional[heading], [list of paragraphs[list of sentences]]] to make the hierarchical section/paragraph spangroups instead of trying to make huge box lists for each piece as originally written. This made it easier to pinpoint the source of SpanGroup overlap errors.
A necessary update was to make _box_groups_to_span_groups keep track of which tokens were already allocated in previous sentences, since we loop through by Grobid paragraph tags, and we were sometimes overlapping between paragraphs.
Another update to _box_groups_to_span_groups that prevents SpanGroup overlaps of a different type (when MergeSpans merges token spans encompassing already allocated tokens) was also added in.
Also added a fix for the "missing attribute: 'coords'" that was also contributing to the overall failure rate.
Ran this on a list of past test PDFs, as well as recently failed PDFs from spp prod logs and they all pass (snippet from jupyter notebook includes my personal debugging comments...)
TODO: