allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Kylel/2023/hardening codebase #259

Open kyleclo opened 1 year ago

kyleclo commented 1 year ago

Big PR. A lot of it is actually just auto linting/formatting. Here are main changes:

  1. For Box, I added some assertions to make sure problematic Boxes are caught at creation time:

        if w < 0 or h < 0:
            raise ValueError(f"Width and height must be non-negative, got {w} and {h}")
        if page < 0:
            raise ValueError(f"Page must be non-negative, got {page}")
        if l < 0 or t < 0:
            raise ValueError(f"Left and top must be non-negative, got {l} and {t}")

    This has been helpful for debugging issues & the library assumes "nice" Boxes.

  2. For Box, I added another assertion for the is_overlap() method:

    def is_overlap(
        self, other: "Box", x: float = 0.0, y: float = 0, center: bool = False
    ) -> bool:
         ...
        if self.page != other.page:
            return False
  3. For Span, I added a merge_boxes: bool = True option to small_spans_to_big_span(). This should preserve the default behavior. But gives us the option to not do that and just create a big Span that has no larger Box. I would prefer that Not-Merging is actually the default behavior, but I think that might break too many things, so maybe in the future.

  4. For Box and Span, I added utility method cluster_boxes() and cluster_spans(). It's based on overlap.

  5. For Box, I also added utility method shrink(). It's not used anywhere now, but it's helpful for debugging.

  6. For BoxGroup and SpanGroup, I've added a new flag in the constructor: allow_overlap: Optional[bool] = False. This should preserve default behavior, which is that it disallows Box or Span within itself to have overlaps. For example, this should catch if a single BoxGroup has duplicate Boxes or a single SpanGroup has duplicate Spans. Otherwise, classes behave as they originally did.

  7. For Indexers, I added a new BoxGroupIndexer class that behaves similarly to SpanGroupIndexer. It's not used, but it was helpful for debugging.

  8. Besides the above library changes, almost everything else I added was a missing test.