output MMIF format - Githubissues

keighrim commented 5 months ago

This thread to discuss output representation of R-F bindings in MMIF syntax and vocab.

wricketts commented 5 months ago

@keighrim - I had some floating questions about RFB and general MMIF structure

When calling mmif[<annotation_id>] on a mmif object, I noticed that it only works if it's the annotation's .long_id. Using regular .id gives a KeyError. Is this intentional?
Maybe I'm answering my own question, but I also noticed that the .id number is not unique globally, but only unique within a view. As in, v_1 and v_2 can both have a TextDocument td_1 in them. Is there an implicit assumption that this td_1 should correspond to the same document across views? If so, are there any innate enforcements/guards for that assumption, or are clams apps developers supposed to write logic complying with that assumption?
The RFB is implemented to return an empty csv if no roles/fillers are identified in the input (due to noise), or if the parser fails. @haydenmccormick brought up the suggestion that we add a runtime parameter to control whether or not the app should generate an annotation if the CSV content is empty. I thought this sounded reasonable, but had a concern related to the above 2 Q's.

If for example, docTR's td_1 was too noisy, and the user opts to have RFB omit empty CSVs, then it's possible that RFB's td_1 could correspond to docTR's td_2 (or a higher number) , which is not super intuitive. Ultimately, the number mismatch won't prevent us from tracing the relation because we have alignments, but it could be less "user-friendly" to not have a global 1-to-1 mapping between id and document.

Do we care about this?
If we do, should we not include this runtime param, and have RFB always return annotations for each OCR textdocument?

keighrim commented 5 months ago

Regarding q1, I have started a new issue to make id unambiguous. https://github.com/clamsproject/mmif/issues/228 The problem is that when we start to force long_id everywhere, that'll break any future apps from past MMIFs (or past apps that generates past MMIFs).

You are right about the annotation id without view-id prefix are implicitly "scoped" to the view it resides. That said, the annotation id can be re-used to refer to different objects as long as the "scope" is different. Thus, having v1:td2 is aligned to v2:td1 is totally fine and we don't care.

Al that put together, I don't think it's a good idea to produce "empty" text document when the RFB parsing fails - it doesn't add any information while adding space and time complexity to handle the MMIF outputs (storage-wise and json.load-wise).

clamsproject / app-role-filler-binder

output MMIF format #1