allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

cleanup Annotation class: remove `uuid`, pull `id` out of `metadata`, remove `dataclasses`, add `getter/setter` for `text` and `type`, make `Metadata()` take args #155

Closed kyleclo closed 2 years ago

kyleclo commented 2 years ago

@soldni im tryin to migrate off dataclass, but the tests are failing at:

ERROR tests/test_eval/test_metrics.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_internal_ai2/test_api.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_grobid_header_parser.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_override.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_parsers/test_pdf_plumber_parser.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_bibentry_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_dictionary_word_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_span_group_classification_predictor.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_predictors/test_vila_predictors.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_document.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_indexers.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_json_conversion.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_metadata.py - TypeError: add_deprecated_field only works on dataclasses
ERROR tests/test_types/test_span_group.py - TypeError: add_deprecated_field only works on dataclasses

not sure how best to handle

soldni commented 2 years ago

@kyleclo I think you can remove all store_field_in_metadata decorators and simply use getters and setters.

For example, for type:

class SpanGroup(Annotation):
    def __init__(
        self,
        spans: List[Span],
        type: Optional[str] = None,
        text: Optional[str] = None,
        id: Optional[int] = None,
        doc: Optional['Document'] = None,
        metadata: Optional[Metadata] = None,
    ):
        super().__init__(id=id, doc=doc, metadata=metadata)

        self.spans = spans
        self.text = text
        self.id = id

    @property
    def text(self) -> str:
        maybe_text = self.metadata.get("text", None)
        if maybe_text is None:
            return " ".join(self.symbols)
        return maybe_text

    @text.setter
    def text(self, text: Union[str, None]) -> None:
        self.metadata.text = text

    @property
    def id(self) -> Union[int, None]:
        return self.metadata.get('id', None)

    @id.setter
    def id(self, id: Union[int, None]) -> None:
        self.metadata.id = id

Just make sure to call super() before assigning self.text and self.id, otherwise self.metadata won't be available!

Then we can remove the super frightening decorator in metadata.py.... so happy we don't have to maintain!

kyleclo commented 2 years ago

@cmwilhelm i'm unsure how to handle this test failing in tests/test_internal_api/test_api.py:

class TestApi(unittest.TestCase):
    def test_vanilla_span_group(self) -> None:
        sg_ann = mmda_ann.SpanGroup.from_json({
            'spans': [{'start': 0, 'end': 1}],
            'metadata': {'text': 'hello', 'id': 1}
        })

        sg_api = mmda_api.SpanGroup.from_mmda(sg_ann)

        self.assertEqual(sg_api.text, 'hello')
        self.assertEqual(sg_api.id, 1)
        self.assertEqual(sg_api.attributes.dict(), {})

the problem is, with the change to how .id is being handled in the library, a typical Document object no longer serializes .to_json() in a manner where the .id value goes into metadata; it's written to JSON at a top-level ID.

so one way to adjust this would be to rewrite this test:

        sg_ann = mmda_ann.SpanGroup.from_json({
            'spans': [{'start': 0, 'end': 1}],
            'id': 1,
            'metadata': {'text': 'hello', 'id': 3}
        })

which would allow id=1 to be passed properly to the constructor, thereby allowing the self.assertEqual(sg_api.id, 1) to pass.

can make this change, but wanted to check this is acceptable

soldni commented 2 years ago

@cmwilhelm i'm unsure how to handle this test failing in tests/test_internal_api/test_api.py:

class TestApi(unittest.TestCase):
    def test_vanilla_span_group(self) -> None:
        sg_ann = mmda_ann.SpanGroup.from_json({
            'spans': [{'start': 0, 'end': 1}],
            'metadata': {'text': 'hello', 'id': 1}
        })

        sg_api = mmda_api.SpanGroup.from_mmda(sg_ann)

        self.assertEqual(sg_api.text, 'hello')
        self.assertEqual(sg_api.id, 1)
        self.assertEqual(sg_api.attributes.dict(), {})

the problem is, with the change to how .id is being handled in the library, a typical Document object no longer serializes .to_json() in a manner where the .id value goes into metadata; it's written to JSON at a top-level ID.

so one way to adjust this would be to rewrite this test:

        sg_ann = mmda_ann.SpanGroup.from_json({
            'spans': [{'start': 0, 'end': 1}],
            'id': 1,
            'metadata': {'text': 'hello', 'id': 3}
        })

which would allow id=1 to be passed properly to the constructor, thereby allowing the self.assertEqual(sg_api.id, 1) to pass.

can make this change, but wanted to check this is acceptable

lol I wrote this test 😅 change looks good to me, but let's make sure @comorado is aware of this change too, since he's using the api too.

kyleclo commented 2 years ago

@comorado pls see above

soldni commented 2 years ago

I didn't see any updates to ai2_internal/api.py. Which is probably 'cause it lacks tests. The changes as are I think will be breaking there, because

@kyleclo you probably want to rebase over main before adding tests, since PR #156 fixed some issues with api / added more tests.

kyleclo commented 2 years ago

thanks, already on it. got almost everything worked out, just one thing, which is whether what should happen when people try to call span_group.metadata.<key> and the key doesnt exist. it currently errors out; should we default to None or is that too much potential for silent failing for everyone's tastes?

@cmwilhelm @soldni @comorado

kyleclo commented 2 years ago

@cmwilhelm pls check if changes oki