Mismatch between pydantic model used for fmu.context and FmuContext enum

tnatt commented 6 months ago

fmu.context.stage generated by ExportData.generate_metadata() can have all of these enums:

https://github.com/equinor/fmu-dataio/blob/7cce4160afb3e75d1ab3655a07bd6926ba370ddc/src/fmu/dataio/_definitions.py#L81-L88

But pydantic model used for schema only supports these Literals: https://github.com/equinor/fmu-dataio/blob/7cce4160afb3e75d1ab3655a07bd6926ba370ddc/src/fmu/dataio/datastructure/meta/meta.py#L327-L334

Hence users will get a validation error when e.g. using preprocessed or case_symlink_realization, like in the script below

edata = dataio.ExportData(
        config=rmsglobalconfig,  # read from global config
        fmu_context="case_symlink_realization",
        name="mymap",
        content="depth",
        is_observation=True,
    )

    filepath = edata.export(regsurf, return_symlink=True)

    myfile = Path(filepath)
    meta_file = myfile.parent / f".{myfile.name}.yml"
    with open(meta_file) as f:
        Root.model_validate(yaml.safe_load(f))

pydantic_core._pydantic_core.ValidationError: 1 validation error for Root
surface.fmu.context.stage
  Input should be 'case', 'iteration' or 'realization' [type=literal_error, input_value='case_symlink_realization', input_type=str]

tnatt commented 6 months ago

What is the desired behavior here @perolavsvendsen and @jcrivenaes ? Keep the schema definition as is and add pydantic model_validators to convert the input to a valid literal that makes sense? Or update allowed literals in the pydantic model?

And of course there might be something that I don't understand fully, as this preprocessed and case_symlink_realization is still a bit fuzzy to me 😄 maybe we don't use these fields when we come to the validation part for these types.. So feel free to fill in my gaps 🙂

perolavsvendsen commented 6 months ago

Perhaps natural, but the schema matches what is actually being used. fmu.context.stage across all data in Sumo (prod) right now:

        {
          "key": "iteration",
          "doc_count": 9645097
        },
        {
          "key": "realization",
          "doc_count": 3123830
        },
        {
          "key": "case",
          "doc_count": 36206
        }

(Answer returned in 0.15 seconds, try to do that on /scratch... 😎)

I guess the preprocess jobs e.g. in Drogon currently use fmu.context.stage: case?

The point of the attribute is to flag where in the workflow a certain data has been generated. But I guess it can be questioned if it is needed or not. Also, relative terms (like "preprocessed") is tricky (pre what?) (a post-process can be a pre-process and vice versa).

This information is also implicitly available through the presence of other tags under fmu, e.g.:

Stage: case

fmu:
  case: ...

Stage: iteration

  fmu:
   case: ...
   iteration: ...

Stage: realization

fmu:
  case: ...
  iteration: ...
  realization: ...

So there is an implied "hierarchy" here: case -> iteration -> realization

Perhaps we for now simply keep this in synch with the presence of fmu.case (always present), fmu.iteration and fmu.realization?

Something like:	stage	`fmu.context.stage`	`fmu` contains
`case`	"case"	`case`	Data created on case level - not part of a specific iteration or realization.
`iteration`	"iteration"	`case`, `iteration`	Data created on iteration (ensemble) level, not part of a specific realization. Example: An aggregated surface
`realization`	"realization"	`case`, `iteration`, `realization`	Data created on realization level (the "standard" data export)

perolavsvendsen commented 6 months ago

Relevant: https://github.com/equinor/fmu-dataio/issues/293

tnatt commented 6 months ago

Ok, investigated a bit to see how the magic happens 🙂 and seems that this preprocessed in the metadata file will be overridden to "case" by the ERT Workflow that does SUMO_UPLOAD on preprocessed data https://github.com/equinor/fmu-drogon/blob/main/ert/bin/scripts/wf_copy_preprocessed_dataio.py. And the glue_metadata_preprocessed function in dataio takes care of updating the necessary metadata https://github.com/equinor/fmu-dataio/blob/7cce4160afb3e75d1ab3655a07bd6926ba370ddc/src/fmu/dataio/_utils.py#L422-L438

Anyway, I support not touching the allowed literals in the pydantic model. They are very clear in their definitions 👍 But we should maybe change it to having a model_validator setting this fmu.context.stage based on the presence of the other fields case, iteration, realization... which can ensure more robustness between the given fmu.context.stage and the rest.

For the fmu.context used inside internally in dataio I'll create a new pydantic model for that one for now and place it in datastructure._internal since it has more allowed fields.

Also I will investigate how the fmu_context=case_symlink_realization is treated, it might also be handled in a special way in dataio and its fmu_context overridden to a standard fmu.context that the schema allows 🙂

equinor / fmu-dataio

Mismatch between pydantic model used for fmu.context and FmuContext enum #544