Add attributes to API data classes

allenai / mmda

multimodal document analysis

Apache License 2.0

158 stars 18 forks source link

Add attributes to API data classes #150

Closed soldni closed 1 year ago

soldni commented 1 year ago

This PR adds metadata to API data classes.

API data classes and mmda types differ in a few significant aspects:

id and type (and text for SpanGroup) are stored in metadata for mmda types; in the APIs, they are part of the top-level attributes.
metadata can store arbitrary content in the mmda types; in the data API, all attributes that are not explicitly declared are dropped.
- we expect applications that require specific fields to declare custom Metadata and SpanGroup/BoxGroup classes that inherit from their parent class. For an example, see tests/test_internal_ai2/test_api.py
- metadata entries are mapped to an attributes field to match how data is stored in the Annotation Store

soldni commented 1 year ago

@cmwilhelm per @yoganandc suggestion, I'm now mapping what is the metadata field in MMDA to the attributes field in the Annotation Store.

This should better fit the design of the annotation store. All previous considerations regarding attributes being typed are still valid.

Let me know what you think!

soldni commented 1 year ago

i was hoping for something much simpler...

can we just have a

class Annotation(BaseModel):
  id: Optional[str]
  type: Optional[str]
  attributes: Dict[str, Any] = {}

class BoxGroup(Annotation):
class SpanGroup(Annotation):

and then in to_mmda()

pydantic_annotation.attributes = dict(mmda_annotation.metadata)

The tricky part in this is that, in mmda, we made the decision to store id and type in the metadata. in fact, mmda_annotation.id, mmda_annotation.type, and mmda_annotation.text are just aliases to mmda_annotation.metadata.id, mmda_annotation.metadata.type, mmda_annotation.metadata.text (where aliases are actually implemented through descriptors).

The reason for such decision was that nor id nor type are actually required in mmda, so we moved off the top level but maintained aliases for backward compatibility. So, when converting to a pydantic object that conforms to SPP APIs, we have to remove them from metadata.

yoganandc commented 1 year ago

So, when converting to a pydantic object that conforms to SPP APIs, we have to remove them from metadata.

👍 a little more logic but still fits the overall structure im proposing right?

cmwilhelm commented 1 year ago

@yoganandc the complexity here I think is @soldni responding to my direct request to have some sort of document schema for however a given model is using metadata. If model authors will be dumping principal inference data into them, we need a way to specify what the shape of that data is. Otherwise no one is going to be able to make sense of it later.