equinor / fmu-dataio

FMU data standard and data export with rich metadata in the FMU context
https://fmu-dataio.readthedocs.io/en/latest/
Apache License 2.0
10 stars 14 forks source link

Basic support for using Everest #408

Open perolavsvendsen opened 7 months ago

perolavsvendsen commented 7 months ago

As a user of Everest, I would like to have data produced available through an API, i.e. Sumo. so that I can utilize the produced data in any application, anywhere.

Stub, to be further described.

@roliveira, @tup1985

tup1985 commented 7 months ago
  1. Creating a provider tailored to Everest:

    • [ ] track the folder structure of the simulation output ( case -> batch -> geo realization -> simulation) Note: various optimization outputs are stored inside the simulation folder, alongside the Eclipse simulation folder.
    • [ ] track the optimization output ( gradients and objective function values)
    • [ ] track case metadata (random seed, perturbation magnitudes of specific control parameters, optimization algorithm, optimization algorithm options, perturbation number, forward models/jobs etc.) Note: case metadata is available from the Everest config file Note: relevant inputs for the jobs which are run specific( i.e pricing file)
  2. Expanding metadata definitions to accommodate Everest data:

    • [ ] account for the various entities mentioned above
perolavsvendsen commented 7 months ago

Starting with the data definitions here (come back to the actual coding of this later?):

Perhaps we add the fmu.simulation tag. This would enable us to identify data objects across multiple simulations within the same realization. This may also make sense for non-Everest workflows, which frequently also have more than one simulation. Today this is cumbersome, since we are left with essentially using data.name for identifying these.

The fmu block in the metadata gives the FMU context to produced data objects, e.g. which realization/iteration they belong to. This is what currently does not expand to the Everest use case.

The current "pattern" inside the fmu metadata block looks something like this:

fmu:
  model
  case
  iteration
  realization | aggregation

(simplified example, each block will expand further with more information, see examples.)

...and the presence/absence of these indicates which context a data object exists in. Examples:

A data object produced inside a specific realization will have:

fmu:
  model
  case
  iteration
  realization

A data object produced across all realizations in an iteration:

fmu:
  model
  case
  iteration

A data object on "case" level (not belonging to a specific iteration or realization, e.g. pre-processed data):

fmu:
  model
  case

...and so on.


Following this logic, adding a "simulation" tag:

fmu:
  model
  case
  iteration
  realization
  simulation:
    name: My Simulation
    id: N/A
    uuid: <uuid4>
    simulator: My Simulator
    etc: etc
    etc: etc

@daniel-sol will it make sense from a SIM2SUMO perspective to populate fmu.simulation? I guess when reading simulation data, it will be useful to have something more tangible than just data.name when identifying e.g. SMRY-data from more than 1 simulation per realization. For instance DST-runs that run in addition to the main simulation.

daniel-sol commented 7 months ago

@perolavsvendsen: Yes, I think for SIM2SUMO, or for any other file produced at a given level I think it would make sense with the fmu.simulation tag, and I guess the idea is that we are separating this from a conventional fmu run with provider. But is that expressed anywhere in the metadata? Because I don't think it would be ideal if you would have to guess from the fact that you have a fmu.simulation tag that you are now in an everest context. How will this be expressed.

When it comes to sim2sumo and the data.name tag, this is is directly derived from the name of the data file for a reservoir simulator run, so you would automatically get that anyway. The way it is set up is that it removes the realization number, which it is the the current convention to include in the datafile name, meaning the unique separator between objects from different realizations is fmu.realization.id, so for distinguishing between several perturbations in an Everest context in the same realization the fmu.simulation would be the unique identifier.

perolavsvendsen commented 7 months ago

The fmu.simulation tag would be made irrespective of conventional FMU or Everest. It would allow us to get away from using the data.name as a defacto identifier. It seems very wobbly.

The idea here would be to populate fmu.simulation and start using that to identify a specific simulation, within a realization. And this would hopefully scale nicely to also the Everest use case.

data.name can remain as is, but I think we should avoid using it for logic.

I would suggest following the same convention as we have done for the other tags under fmu:

fmu:
  simulation:
    name: MySimulation
    uuid: [hash of something, e.g. realization.uuid + simulation.name]

Then we would have a unique ID for the simulation, instead of assuming that the name is the identifier. (I can easily break that by exporting something else with the same name?)