IEA-Task-43 / digital_wra_data_standard

IEA Task 43: pre-construction energy estimate data standard repository
BSD 3-Clause "New" or "Revised" License
56 stars 15 forks source link

[SCHEMA] - include a VMM or reanalysis measurement location type? #214

Closed stephenholleran closed 5 months ago

stephenholleran commented 1 year ago

From the February 2022 users workshop the below conversation occurred.

image

Steve Clark didn't create an issue so here it is.

I think the data model can easily handle reanalysis or VMM data. A measurement location type might be useful to distinguish from met masts, lidars, etc.?

abohara commented 1 year ago

@stephenholleran Agreed. Is a virtual_met_mast measurement location type sufficient ?

kersting commented 1 year ago

@stephenholleran and @abohara I think that this is a complex problem and needs its own data schema. The metadata related to reanalysis data is complex. Here are just some examples related to that complexity.

  1. Reanalysis is time zone dependent. One could use logger main config to capture that but it does not seem to be the best place to this.
  2. Reanalysis data can have different types. For example, it can be raw ERA-5 or downscaled ERA-5. The time series without the assumptions of the modeling will be incomplete. It is also important to know who is the organization that created the time series which does not fit in the plant table as it is because maybe company X is the owner but company Y created the series.

There are other complications but this is just a short example of the difficulties to try to merge that into the current schema.

Another thing to keep in mind is the fact that netCDF is a popular format in this environment and it already carries a good amount of metadata. Can we use that data format in our advantage to come up with a concise way of having a complete metadata schema along with the time series?

If this is really necessary perhaps it is time for us to bring the main providers of these datasets and have a discussion about it.

abohara commented 1 year ago

@kersting Thanks for your comments and perspective.

My thought was that if we did receive a VMM csv (or other tabular data) then the data model could be used to describe basic information each column (e.g. wind speed at 80m ) and pass on that very basic set of information. However, agree with you we may need another table (or many tables) specifically to describe the more detailed "metadata" around reanalysis.

Is there is a pre-existing structure that you consistently see in other files that we may be able to use an existing template?

stephenholleran commented 1 year ago

Hi @abohara , @kersting,

Fair points @kersting.

As the WRA Data Model stands except for measurement_station_type_id, the below example is how a VMM could be incorporated which has a wind speed and wind direction.

To answer your points @kersting I think it is fair enough to use the logger_main_config to capture the timezone offset. I agree it is not great, but I think it is fine.

Your other point about the different types (raw ERA5 or downscaled ERA5) I think we could possibly distinguish between the two by incorporating a is_calculated field on the measurement point or something similar. This is_calculated field or not calculated has come up previously in terms of floating lidar correction on wind speed and on solar measurements. This could be a nice solution to solve a few different issues.

Looking at the below example, what else would you like to see captured to describe a VMM?

My feeling is that the below example captures a VMM pretty well from a first cut perspective.

Cheers,


Below is an example VMM according to the WRA Data Model, except for measurement_station_type_id.

{
  "author": "Vortex System",
  "organisation": "Vortex",
  "date": "2023-04-25",
  "version": "1.2.0-2023.01",
  "measurement_location": [
    {
      "name": "Vortex ERA5 downscaled",
      "latitude_ddeg": 53,
      "longitude_ddeg": -5.5,
      "measurement_station_type_id": "virtual_met_mast",
      "logger_main_config": [
        {
          "logger_oem_id": "Other",
          "logger_serial_number": "---",
          "date_from": "2000-01-01T00:00:00",
          "date_to": null,
          "offset_from_utc_hrs": -5
        }
      ],
      "measurement_point": [
        {
          "name": "WS100m",
          "measurement_type_id": "wind_speed",
          "height_m": 100,
          "height_reference_id": "ground_level",
          "logger_measurement_config": [
            {
              "measurement_units_id": "m/s",
              "date_from": "2000-01-01T00:00:00",
              "date_to": null,
              "column_name": [
                {
                  "column_name": "WS100m",
                  "statistic_type_id": "avg"
                }
              ]
            }
          ]
        },
        {
          "name": "WD100m",
          "measurement_type_id": "wind_direction",
          "height_m": 100,
          "logger_measurement_config": [
            {
              "measurement_units_id": "deg",
              "date_from": "2000-01-01T00:00:00",
              "date_to": null,
              "column_name": [
                {
                  "column_name": "WD100m",
                  "statistic_type_id": "avg"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
kersting commented 1 year ago

@stephenholleran and @abohara I think that the only downside is putting the VMM settings in the logger_main_config. It is not optimal but it can be done. There are other settings such as timestamp_is_end_of_period and averaging_period_minutes that would need to be incorporated as well and those are doable. The is_calculated. is a nice solution for the downscaled issue. I do think that we can adapt the data model to account for VMM time series by using logger_main_config but I'm not sure it is the best path. For example, think of VMM providers in the industry. Their products are not confined only to time series. There options for exports of wrg files, extreme gust, icing, etc. Do VMM series belong to that category of products generated by reanalysis data or do VMM series fit better with measured data? Wouldn't be better to create a data model that describes datasets are derived from reanalysis data? I don't have an answer for that question but I think it is an important debate for us to have. Perhaps we can discuss this in the next meeting.

stephenholleran commented 1 year ago

Hi @kersting,

I would consider reanalysis data (MERRA-2 and ERA5) and timeseries virtual met mast data at a particular geographic location to fit within our data model.

Other downscaling products like wind maps, wrgs that cover an "area" would be completely different and have a different data model I would expect. (That data model could be used for scanning lidars, thinking out loud.)

I think it is reasonable to expand our data model to include reanalysis and VMM timeseries data. It wouldn't take much. Expanding on my previous suggestion of adding "virtual_met_mast" to measurement_station_type_id, we could distinguish by also including "reanalysis".

I agree the logger_main_config is not the greatest place to capture the details, but I don't think it is that bad either, and with a tutorial it is easily explainable. It is a quick MVP for VMM.

cc @abohara

kersting commented 1 year ago

@stephenholleran I agree that by pushing the boundaries of logger_main_config, we are able to describe a VMM and with some additional tweaks. I am afraid that we're mixing two very different concepts under the same data model. I'd rather have this in a separated model. I also agree that this is a low hanging fruit, meaning few tweaks to have a brand new time series supported by the model. I wonder if it would make sense to do a survey in the next workshop to the resource assessment community and ask the question if people would be supportive of mixing those different concepts under the same umbrella. Also, I don't want to make this complicated so I'm flexible in accepting the proposal if a survey would make things too complicated.

oriollacave commented 1 year ago

Hi, I have just a few comments. I do not full know all the previous agreements and steps bu will give my opinion. Just for time series, having spatial grids would also need a separated discussion.

We are here defining as measurement and logger but we are willing to include synthetic data (virtual met mast as you say, which I don't like at all, it's fine tough). I would then prefer to refer in general as meteorological_data, for example, and add an extra field as source or generation_type which would lead to measurement or synthetic. I would then define special tags for measurement (type of, (sodar, met mast, lidar, satellite, simulation, ...). Each would then have different metadata for details about equipment/simulation.

The other thing is that for synthetic data it would be interesting to have details about the simulation. We typically use in our headers that very small but important info: final resolution, reanalysis source, model used.

We typically also separate variables and heights, so don't understand why there is WS100m when we are also labeling height 100m. It helps to have variable separated for automation purposes. What if we are having a variable with multiple heights? Can we set the height to an array-like field?


{
  "author": "Vortex System",
  "organisation": "Vortex",
  "date": "2023-04-25",
  "version": "1.2.0-2023.01",
  "location": [
    {
      "name": "Vortex ERA5 downscaled",
      "latitude_ddeg": 53,
      "longitude_ddeg": -5.5,
      "source": "virtual_met_mast",
      "main_config": [
        {
          "logger_oem_id": "Other",
          "logger_serial_number": "---",
          "date_from": "2000-01-01T00:00:00",
          "date_to": null,
          "offset_from_utc_hrs": -5,
          "model": "WRF",
         "reanalysis": "ERA5",
         "resolution": "3km"
        }
      ],
      "measurement_point": [
        {
          "name": "WS",
          "measurement_type_id": "wind_speed",
          "height_m": 100, 110, 120
          "height_reference_id": "ground_level",
          "logger_measurement_config": [
            {
              "measurement_units_id": "m/s",
              "date_from": "2000-01-01T00:00:00",
              "date_to": null,
              "column_name": [
                {
                  "column_names": "WS_100m","WS_110m","WS_120m",
                  "statistic_type_id": "avg"
                }
              ]
            }
          ]
        },
        {
          "name": "WD",
          "measurement_type_id": "wind_direction",
          "height_m": 100,
          "logger_measurement_config": [
            {
              "measurement_units_id": "deg",
              "date_from": "2000-01-01T00:00:00",
              "date_to": null,
              "column_name": [
                {
                  "column_name": "WD_100m",
                  "statistic_type_id": "avg"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
oriollacave commented 1 year ago

Other thing is that this is exclusively for ASCII files but we starting to disseminate in netcdf and zarr formats, which do not account for "columns". Ok, that files have it's own metadata, but having json/yaml in prior can help a lot for example in reducing querying and/or download times. May be it's another thread but would like to know how this is being treated. Thanks all for your work and time!

kersting commented 1 year ago

@oriollacave excellent remarks so thank you for sharing your opinions. I have some comments about what you posted.

  1. I agree that spatial grids need a separated discussion. I think the spirit here is whether we can use the data model for measurements and extend to other kinds of data as a quick win. In my opinion this whole subject needs a separated data model but I see how we could quickly adapt our current data model to this reality.
  2. Synthetic data means a lot of things. It could be a time series generated via Markov Chain processes, bootstrapping, MCP, etc. Therefore I think virtual met mast is more specific. I'd challenge you to provide a better naming that is not as general as synthetic data.
  3. I totally agree with you about the details of the simulation and this is one of the reasons why I think for this kind of data, having its own model makes more sense.
  4. In the data model we use measurement points as a list of dictionaries. In my point of view for measurement data, this allows more flexibility as measurement points may have completely different configurations which would make harder to bundle them together. Perhaps for synthetic data, your proposal would work.

@stephenholleran or @abohara anything to add here?

@oriollacave would you be open for a meeting to talk about this topic?

abohara commented 1 year ago

@kersting @oriollacave thanks for the good comments and discussion.

  1. Re: Spatial grid - We are tracking the lat/lon for each point location. Is that insufficient to cover the grid use case for WRA purposes ?

I agree on the need to communicate some basic details about the simulation. I can see it being communicated about the entire VMM generation process or at the each measurement point level ( or both )

  1. For tracking simulation params / settings : I can see an additional table like logger_main_config e.g. simulation_main_config that describes variables related to to the overall simulation settings. Fields like, final resolution, reanalysis source, model used etc. could be in this table. I think re-purposing logger_main_config would likely require a lot of modifications that would pull it away from its original purpose of tracking a physical logger. I agree with @oriollacave and @kersting on this.

  2. I am not sure I understand your concerns @kersting about the list vs dict, but perhaps it was based on the misinterpretation of the data model by @oriollacave ?

@oriollacave Your json example in the post is different than what @stephenholleran posted above. His example does address some of your concerns:

@oriollacave I apologize if I misunderstood your intentions & recommendations here.

Overall, I agree with your sentiments, that this may not be the most "perfect" way to communicate the entirety of a "simulation", but the portion that is needed for resource assessment can fit into our model "sufficiently" to meet just the RA needs. For RA end user, having the data in one consistent format regardless of the source in my view has benefits that may outweigh some of the drawbacks ( which mostly seem like naming oddities at the moment ).

stephenholleran commented 1 year ago

Thanks @oriollacave for your great constructive comments. Thanks too @abohara and @kersting.

I'll try and categorize the issues and summarize:

Spatial data Definitely a different topic for discussion. This thread is just dealing with timeseries data at a particular geographical location.

Separate variables and heights I don't think this is an issue as @kersting had pointed out that for measured wind speed values there are a lot of other statistic types and data columns all associated with the one wind speed measurement at 100m. Therefore we keep the measurement points separate so everything associated with that measurement point is gathered in the one place. I think my example may have been misleading as there was just the one measurement point.

Source or as we have it Measurement Station Type As @abohara said we already have a field to capture 'source' and this is 'measurement_station_type_id'. The data model current list of "measurement_station_type_id" are mast, lidar, sodar, floating_lidar and solar. The intention is to add reanalysis and virtual_met_mast to distinguish datasets that come from reanalysis and downscaling models. What the name of these are is up for debate.

How to capture simulation properties OK, so we need to capture some simulation properties like model_used, final_spatial_resolution, reanalysis_source. As @abohara alluded to these are probably better in a separate table and not to clog up the logger_main_config table. This can be in the same vain as the vertical_profilar_properties and mast_properties. I would still use the logger_main_config for timezone and start and end dates.


Before getting into the detail of a simulation properties table, I would first like to confirm that we would all be happy to make these changes to the WRA Data Model to capture reanalysis and downscaled data? I definitely am.

Thanks all!

abohara commented 1 year ago

+1 on

Before getting into the detail of a simulation properties table, I would first like to confirm that we would all be happy to make these changes to the WRA Data Model to capture reanalysis and downscaled data? I definitely am.

oriollacave commented 1 year ago

Hi, Agree in most, as I'm also understanding better the schema, so thanks for the positive comments. I agree in moving forward with what proposed @stephenholleran .

Virtual met mast is fine and the better ( I don't like but it is accepted, well understood and marketing always wins).

Anyway, just a question and two comments/proposals.

Apology if I'm confused or missing something. I think @stephenholleran can move forward and use whatever from this extra information makes sense.

@oriollacave would you be open for a meeting to talk about this topic?

sure! just let me know a date

stephenholleran commented 1 year ago

Hi @oriollacave,

Sorry for the slow response, it has been a long week with lots of IEA Task and data standards/sharing things going on all at once.

sure! just let me know a date

Our next scheduled meeting is the 22nd but that will be fully discussing the workshop which will be on the 29th. The meeting after that, on the 6th July, will be a debrief of the workshop. The next scheduled meeting would then be the 20th July at 4 pm Irish time. (I can't do the 13th as we have a company day.)

The 20th July is a bit away but it'll probably come around fast enough. Would that be ok for everyone?

stephenholleran commented 7 months ago

Hi @oriollacave, @abohara, @kersting,

I am finally getting around to actually doing something with this.

First off I have added reanalysis and virtual_met_mast to the measurement_station_type enum. I think this is fine and we have concluded on that.

Second thing is the additional table to capture the properties of either reanalysis or virtual met mast configuration. I haven't made any changes to the actual schema yet, I have just created a sample of what it might look like. You can check out this draft PR #246 to see the changes. I'll paste the extra table below for ease of discussion.

    "main_config": [{
      "reanalysis_source": "era5",
      "final_resolution_m": "11000",
      "model_used": "ECMWF IFS",
      "offset_from_utc_hrs": 0,
      "averaging_period_minutes": 60,
      "timestamp_is_end_of_period": false,
      "date_from": "2000-01-01T00:00:00",
      "date_to": null,
      "notes": "This is an ERA5 reanalysis dataset produced by ECMWF.",
      "update_at": "2023-11-24T18:13:00"
    }],

It would be good if we were able to discuss this on a call. The next scheduled one is the 7th Dec at 4pm Irish time. Let me know if you can make it?

Some points for discussion:

  1. I'm not too sure what to call this table. We can be general and call it main_config similar to logger_main_config or we can be more specific and call it model_config or simulation_config?
  2. The 3 new fields are reanalysis_source, final_resolution_m and model_used. Are these the best names? All are optional.
  3. I have duplicated the offset_from_utc_hrs, averaging_period_minutes and timestamp_is_end_of_period from the logger_main_config table. This way we just drop that table when it is a reanalysis or VMM. This can be restricted in the JSON Schema too, if we want that?

Feel free to respond here.

kersting commented 7 months ago

@stephenholleran thanks for looking into this. I'll be in the next meeting on the 7th and we can discuss more. Here are some of my remarks.

oriollacave commented 7 months ago

Looks good to me too. Resolution is not always in m, so having this way might be only approximate. Not a big deal, just for you to know. I'm available for 7th Dec at 4pm Irish time.

As an off topic, are you thinking adding an md5 or sha hash and filename tags & filename to link to the data? That's something important for us.

stephenholleran commented 6 months ago

Below is an abstract from our Dec 7th, 2023 bi-weekly call. https://github.com/IEA-Task-43/digital_wra_data_standard/discussions/129#discussioncomment-7778423

  1. CJ (Natural Power) shared a document from IEC 61400-15 where they have discussed reference datasets, back around 2018 and earlier. They have a definition for VMM. They also list properties that should be included for either a reanalysis or VMM dataset. We pretty much cover the reanalysis properties but not all the VMM properties. These include for e.g. nudging method, PBL schemes, number of nests, domain size, orography, land use, etc. The feeling was that though these are important for comparing downscaling models, these details are generally available on the providers websites/scientific papers and are not necessarily needed to be included in every single dataset. Similar to lidar data, there are lots of parameters/technology that are important to compare different lidars, however these are not required for a wind analyst when performing an energy yield assessment and so are not included in the WRA Data Model. Therefore, we will leave these out for now. As we will have a new table to capture model settings (this is the big change) we can easily add any or all of these in the future quite quickly.
  2. On the new table name, provisionally called main_config it was decided that this may be confusing for a user as it is not specific enough. We already have logger_main_config which a user knows they need to fill out when they have loggers but they may also think they need to fill out something in main_config as it is not specific. Therefore, we decided we should be more specific and rename it..
  3. It was decided to rename main_config to model_config as opposed to simulation_config or synthetic_config. Model is more generic for this type of data. An AI generated dataset is not a simulation and so should be more flexible for future use to capture AI generated datasets. The results of the modelling is synthetic data but you don't have a synthetic configuration.
  4. SH to continue the iterations on this and continue the discussion on the GitHub issue log.
stephenholleran commented 5 months ago

Below is an extract from our 18th Jan 2024 call regarding this issue. https://github.com/IEA-Task-43/digital_wra_data_standard/discussions/129#discussioncomment-8172178 Unfortunately not many people turned up so it would be good if we could continue the discussion here and come to some conclusions.

  1. Should we include all available reanalysis datasets or just MERRA-2 and ERA5? Am I missing any? https://github.com/IEA-Task-43/digital_wra_data_standard/pull/246/files#diff-da1380042a0574c16546fc2a08f9f6ff97b3817ff19bf0b88ef0e846101a2824R1023 Keep the full list so it is complete.
  2. I think reanalysis_source should just be reanalysis as it is not a "source" when the data is an actual reanalysis dataset. The definition would still include "source". "The name of the reanalysis dataset that is the source or the result of this model." No consensus. Post question in issue log.
  3. For final_resolution_m what does a user put in when this changes between equator and the poles e.g. for a reanalysis dataset? It also changes between N-S and E-W?
  4. Do we need the word final? Is it not obvious that the resolution we are taking about is the resolution of the model that actually resulted in the dataset that is been described in the implemented data model? I would be inclined to put the word horizontal in there instead? No decision on the word 'final'. grid_resolution maybe a better term or even horizontal_grid_resolution? Been specific that it is a horizontal resolution would be preferred.
    1. Units for final_resolution_m, 'm', 'km' or 'decimal degrees? 'm' might make more sense for a modelled dataset but for reanalysis this could be left out.
    2. Definition: "The final horizontal resolution, in meters, of the model used to create this dataset. If a reanalysis dataset this can be an approximation or left out."
    3. I think offset_from_utc_hrs, averaging_period_minutes and timestamp_is_end_of_period should all be definitions as they are now duplicated in both the 'model_config' and 'logger_main_config' tables? Yes, this is good.
    4. I think it is possible in the schema to limit using only one of these tables. That is, a user can use either the 'model_config' or the 'logger_main_config' but not both. Do we want to implement this limit? Not critical but would be useful. It could be included in future. If not implemented this time around, add an issue to track it.

@oriollacave it would be great to get your input here especially!! cc @abohara, @kersting

stephenholleran commented 5 months ago

Hey @oriollacave, would love to get your input on the above queries? I can have a call with you directly if you prefer? Let me know. Thanks,

oriollacave commented 5 months ago

HI! This is clearly on my TODO list in urgent folder, just below SUPER URGENT. I will work on it beginning next week.

oriollacave commented 5 months ago

My OPINIONS below.

Below is an extract from our 18th Jan 2024 call regarding this issue. #129 (comment) Unfortunately not many people turned up so it would be good if we could continue the discussion here and come to some conclusions.

  1. Should we include all available reanalysis datasets or just MERRA-2 and ERA5? Am I missing any? https://github.com/IEA-Task-43/digital_wra_data_standard/pull/246/files#diff-da1380042a0574c16546fc2a08f9f6ff97b3817ff19bf0b88ef0e846101a2824R1023 Keep the full list so it is complete.

This looks good for me. We know this will change with time.

  1. I think reanalysis_source should just be reanalysis as it is not a "source" when the data is an actual reanalysis dataset. The definition would still include "source". "The name of the reanalysis dataset that is the source or the result of this model." No consensus. Post question in issue log.

reanalysis alone is understood. Otherwise we would have to set initial conditions and boundary conditions.

  1. For final_resolution_m what does a user put in when this changes between equator and the poles e.g. for a reanalysis dataset? It also changes between N-S and E-W?

True. This may be approximate but from my experience, more useful than degrees.

  1. Do we need the word final? Is it not obvious that the resolution we are taking about is the resolution of the model that actually resulted in the dataset that is been described in the implemented data model? I would be inclined to put the word horizontal in there instead? No decision on the word 'final'. grid_resolution maybe a better term or even horizontal_grid_resolution? Been specific that it is a horizontal resolution would be preferred.

horizontal_grid_resolution is perfect.

  1. Units for final_resolution_m, 'm', 'km' or 'decimal degrees? 'm' might make more sense for a modelled dataset but for reanalysis this could be left out.

for reanalysis this is important too.

  1. Definition: "The final horizontal resolution, in meters, of the model used to create this dataset. If a reanalysis dataset this can be an approximation or left out."

So this would lead to:

"The final horizontal grid resolution, in meters, of the model used to create this dataset."

  1. I think offset_from_utc_hrs, averaging_period_minutes and timestamp_is_end_of_period should all be definitions as they are now duplicated in both the 'model_config' and 'logger_main_config' tables? Yes, this is good.
  2. I think it is possible in the schema to limit using only one of these tables. That is, a user can use either the 'model_config' or the 'logger_main_config' but not both. Do we want to implement this limit? Not critical but would be useful. It could be included in future. If not implemented this time around, add an issue to track it.

As there is only one dataset for a json, only one should be available. No discrepancies could occur. Does this makes sense?

@oriollacave it would be great to get your input here especially!! cc @abohara, @kersting

stephenholleran commented 5 months ago

Hi @oriollacave,

Thanks a million for your responses. All great, just one clarifying question from me now.

For the "horizontal_grid_resolution_m" definition I think we should still include that for reanalysis datasets this can be an approximation. We can drop the optional bit. Something like

"The final horizontal grid resolution, in meters, of the model used to create this dataset. If a reanalysis dataset, this can be an approximation e.g. '50000' for MERRA-2 as it is about 50 km in the latitudinal direction."

https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/ image

oriollacave commented 5 months ago

Agree. May be I was not clear. I said that having the horizontal_grid_resolution_m for reanalysis is important too.

stephenholleran commented 5 months ago

This is now merged into the 'dev' branch and will soon be in a new release.