equinor / fmu-dataio

FMU data standard and data export with rich metadata in the FMU context
https://fmu-dataio.readthedocs.io/en/latest/
Apache License 2.0
10 stars 15 forks source link

DOC: Agree on and describe usage of key metadata fields #794

Open perolavsvendsen opened 1 month ago

perolavsvendsen commented 1 month ago

(This links to the larger topic related to both development and documentation/communication of the datamodel)

While work is ongoing to limit the amount of freedom given during data export, there is most likely still going to be some degree of freedom, particularly for custom/non-standard data. This means that in some instances, we will still have to deal with the various fields being used properly and improperly.

I think the documentation should contain a description/definition of the most important (free-text) fields and how they are to be used.

We are seeing several examples now of these being used differently, between different applications and in different contexts, which is becoming a problem (and will become a major problem down the road.)

Examples:

Context 1 Context 2 Context 3
In the provided examples for export of horizons extracted from the structural model (main structural prediction of the model), data.name is used for the horizon name (links to masterdata) and data.tagname is used for larger categories of export, e.g. "extract_from_structuralmodel" or similar. In Sumo aggregation service, when aggregating parameters.txt (for faster reading) data.name is "parameters" for all files, while data.tagname is each individual parameter - resulting in 500+ unique tagnames for a case. In export of various seismic cubes from Drogon, the data.name is always "seismic", while data.tagname is a composite of type of cube and vertical domain, e.g. "amplitude_depth", "relai_depth".

It is likely that some of this variation is for backwards compatibility reasons and due to the use of data.name and data.tagname in the filename on disk in addition to metadata. However, we should take care to not inherit these things. These pattern will lead to applications putting very different logic on the same metadata field depending on the context. At some point, these will grow together and this will be very chaotic.

I think a possible "fix" for this would be to clearly describe how the most critical metadata fields are intended to be used. This is not clear from the names alone (i.e. "tagname" means nothing, "name" is ambiguous).

[needs discussion and refinement]

mferrera commented 1 month ago

Based on discussion around simplified exports, we'd like to NOT expose the ability for users to set these via simplified exports, if possible. This implies we'd need to establish a regime for defaulting them, which implies this issue is a dependency for them.

Relevant items for this comment:

mferrera commented 1 month ago

Current thought: we should (eventually) get rid of tagname altogether. It being a multipurpose catch-all is, as mentioned, a relic of the filesystem-centric viewpoint. I claim that now it is also obscuring weaknesses in the data model.

A goal could be to describe the data tagname represents elsewhere in the schema such that we would be able to construct the tagname from that data rather than have it be a freetext field. If we can do that, we can much more easily move forward.

We are taking a more in-depth survey of these items. A big challenge is still #734. We need to see if it's possible to enumerate property attributes.

This, by extension, means the scope of simplified functions is getting broader. That's not ideal from an implementation perspective, but good in that we are taking these fundamental issues head-on. We could still implement these functions, or a subset of them, using non-finalized iterations of default values for these fields.

cf https://github.com/equinor/atlas/issues/42#issuecomment-2396173422 🔒

perolavsvendsen commented 1 month ago

Yes, sounds like a good approach to me. The fact that we are generating files on disk according to the old standard in addition to creating rich metadata etc seems to be a pretty defining choice. It is important, but it is also a pretty effective mechanism for allowing end-users to keep all their existing dependencies on data on disk, hence slowing down the whole transition to API.

I think it's worth discussing at some point if we should revisit this assumption, i.e. simply stop storing data on disk for some (all?) of the defined "products" when we get there. It needs a proper discussion, and I'm not entirely convinced that we are there yet. But could be that we will never get there if we don't do something slightly more disruptive than what we are doing now (which is also causing lots of complexity and accumulation of brand new technical debt.)