allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

adding relations #158

Open kyleclo opened 2 years ago

kyleclo commented 2 years ago

This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called key and value.

A few things needed to change to support Relations:

Annotation Names

Relations store references to Annotation objects. But we didn't want Relation.to_json() to also .to_json() those objects. We only want to store minimal identifiers of the key and value. Something short like bib_entry-5 or sentence-13. We call these short strings names.

To do this, we added to Annotation class, an optional attribute called field: str which stores this name. It's automatically populated when you run Document.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under .field.

We also added a method name() that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores .field and .id.

In short, now after you annotate a Document with annotations, you can do stuff like:

doc.tokens[15].name   ==   AnnotationName(field='tokens', id=15)
str(annotation_name)  ==   'tokens-15'
AnnotationName.from_str('tokens-15')  ==  AnnotationName(field='tokens', id=15)

Lookups based on names

To support reconstructing a Relation object given the names of key and value, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:

annotation_name = AnnotationName.from_str('paragraphs-99')
a = document.locate_annotation( annotation_name )   -->  returns the specific Annotation object
assert a.id == 99
assert a.field == 'paragraphs'

to and from JSON

Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:

{'key': <name_of_key>, 'value': <name_of_value>, ...other stuff that all Annotation objects have,  like Metadata...}

Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.

The API for this is similar, but you must also pass in the Document object:

relation = Relation.from_json(my_relation_dict, my_document_containing_necessary_fields)
kyleclo commented 1 year ago

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

soldni commented 1 year ago

@soldni

The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?

Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's field_name - integer_id, but it's possible in the future this will need to be extended.

As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want field, id = obj.split('-') everywhere throughout the code as it gets hard to maintain in case we ever change something. The class allows us to have methods .field and .id for use here.

@kyleclo Sounds good! added two small suggestions to improve it, but otherwise ok to merge!