clamsproject / mmif

MultiMedia Interchange Format
Apache License 2.0
5 stars 1 forks source link

MMIF json-scheme must invalidate empty values for required fields #196

Closed keighrim closed 1 year ago

keighrim commented 1 year ago

Because

As the title says, the current json scheme DOES take empty strings as valid values for required fields. This is an expected behavior of the json schema (http://json-schema.org/draft/2020-12/json-schema-validation.html#name-required), but for some fields, I think we should add additional validation rules to reject empty strings. For example (related to #153), I think the app name in view metadata that points to a producer must not b empty.

Done when

We review all of the required fields in the current mmif.json and decide whether to allow empty string values, respectively. For those we decide to disallow empty values, we then add additional gears for validation using http://json-schema.org/draft/2020-12/json-schema-validation.html#name-validation-keywords-for-str.

Additional context

No response

keighrim commented 1 year ago

Here's a list of current required fields, and whether to allow empty values for them.

field allow empty? notes
root.metadata.mmif mmif spec version
root.documents[]
root.views[] 👌
root.views[].metadata.app producer name
root.views[].metadata.contains 👌 oneOf (contains, error, warnings)
root.views[].metadata.warnings oneOf (contains, error, warnings)
root.views[].metadata.error.message oneOf (contains, error, warnings)
root.views[].id
root.views[].annotations[] 👌
root.views[].annotations[].@type
root.views[].annotations[].properties.id
text.@value probably 👌

Any thoughts?

keighrim commented 1 year ago

If root.views[].annotations[] can be empty, then root.views[].metadata.contains can also be empty, I guess. Updated the table.

keighrim commented 1 year ago

@marcverhagen pointed out that source document list shouldn't be allowed to be empty. Also, this change of schema might break some existing data, so we will put off merging #198 until the release of 1.0.0.

marcverhagen commented 1 year ago

I thought we were talking about 2.0.0. But the actual version number is not important, as long as we wait till after the end of our current clams-melon phase.

keighrim commented 1 year ago

I was thinking of a way to resolve https://github.com/clamsproject/mmif-python/issues/205, and realized that if the change proposed in #198 breaks anything, I think it should break early so that we don't need to deal with 20 different broken pieces in the future, instead of dealing with maybe a couple of broken pieces now.

Honestly, though, I think the only piece this will "break" as of now is the "gold" MMIF files in the annotation that do not have app values (for my quick, unsorted thought on how fix those files, see https://github.com/clamsproject/mmif/issues/153#issuecomment-1485513488 ), which are supposed to be input to the NER evaluation code, but as @Angela-Lam hasn't been reported any progress on the NER evaluation code so far, I guess the files are not used anywhere at the moment. So if we don't fix the files and the schema now, we probably need to deal with those files and NER evaluation code in the future, and I don't see any reason to postpone this "bug" fix that will only be snowballing the technical debt.