Closed stephenholleran closed 5 months ago
@stephenholleran Agreed. Is a virtual_met_mast
measurement location type sufficient ?
@stephenholleran and @abohara I think that this is a complex problem and needs its own data schema. The metadata related to reanalysis data is complex. Here are just some examples related to that complexity.
There are other complications but this is just a short example of the difficulties to try to merge that into the current schema.
Another thing to keep in mind is the fact that netCDF is a popular format in this environment and it already carries a good amount of metadata. Can we use that data format in our advantage to come up with a concise way of having a complete metadata schema along with the time series?
If this is really necessary perhaps it is time for us to bring the main providers of these datasets and have a discussion about it.
@kersting Thanks for your comments and perspective.
My thought was that if we did receive a VMM csv
(or other tabular data) then the data model could be used to describe basic information each column (e.g. wind speed at 80m ) and pass on that very basic set of information. However, agree with you we may need another table (or many tables) specifically to describe the more detailed "metadata" around reanalysis.
Is there is a pre-existing structure that you consistently see in other files that we may be able to use an existing template?
Hi @abohara , @kersting,
Fair points @kersting.
As the WRA Data Model stands except for measurement_station_type_id
, the below example is how a VMM could be incorporated which has a wind speed and wind direction.
To answer your points @kersting I think it is fair enough to use the logger_main_config
to capture the timezone offset. I agree it is not great, but I think it is fine.
Your other point about the different types (raw ERA5 or downscaled ERA5) I think we could possibly distinguish between the two by incorporating a is_calculated
field on the measurement point or something similar. This is_calculated
field or not calculated has come up previously in terms of floating lidar correction on wind speed and on solar measurements. This could be a nice solution to solve a few different issues.
Looking at the below example, what else would you like to see captured to describe a VMM?
My feeling is that the below example captures a VMM pretty well from a first cut perspective.
Cheers,
Below is an example VMM according to the WRA Data Model, except for measurement_station_type_id
.
{
"author": "Vortex System",
"organisation": "Vortex",
"date": "2023-04-25",
"version": "1.2.0-2023.01",
"measurement_location": [
{
"name": "Vortex ERA5 downscaled",
"latitude_ddeg": 53,
"longitude_ddeg": -5.5,
"measurement_station_type_id": "virtual_met_mast",
"logger_main_config": [
{
"logger_oem_id": "Other",
"logger_serial_number": "---",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"offset_from_utc_hrs": -5
}
],
"measurement_point": [
{
"name": "WS100m",
"measurement_type_id": "wind_speed",
"height_m": 100,
"height_reference_id": "ground_level",
"logger_measurement_config": [
{
"measurement_units_id": "m/s",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"column_name": [
{
"column_name": "WS100m",
"statistic_type_id": "avg"
}
]
}
]
},
{
"name": "WD100m",
"measurement_type_id": "wind_direction",
"height_m": 100,
"logger_measurement_config": [
{
"measurement_units_id": "deg",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"column_name": [
{
"column_name": "WD100m",
"statistic_type_id": "avg"
}
]
}
]
}
]
}
]
}
@stephenholleran and @abohara I think that the only downside is putting the VMM settings in the logger_main_config
. It is not optimal but it can be done. There are other settings such as timestamp_is_end_of_period
and averaging_period_minutes
that would need to be incorporated as well and those are doable. The is_calculated.
is a nice solution for the downscaled issue. I do think that we can adapt the data model to account for VMM time series by using logger_main_config
but I'm not sure it is the best path. For example, think of VMM providers in the industry. Their products are not confined only to time series. There options for exports of wrg files, extreme gust, icing, etc. Do VMM series belong to that category of products generated by reanalysis data or do VMM series fit better with measured data? Wouldn't be better to create a data model that describes datasets are derived from reanalysis data? I don't have an answer for that question but I think it is an important debate for us to have. Perhaps we can discuss this in the next meeting.
Hi @kersting,
I would consider reanalysis data (MERRA-2 and ERA5) and timeseries virtual met mast data at a particular geographic location to fit within our data model.
Other downscaling products like wind maps, wrgs that cover an "area" would be completely different and have a different data model I would expect. (That data model could be used for scanning lidars, thinking out loud.)
I think it is reasonable to expand our data model to include reanalysis and VMM timeseries data. It wouldn't take much. Expanding on my previous suggestion of adding "virtual_met_mast" to measurement_station_type_id
, we could distinguish by also including "reanalysis".
I agree the logger_main_config
is not the greatest place to capture the details, but I don't think it is that bad either, and with a tutorial it is easily explainable. It is a quick MVP for VMM.
cc @abohara
@stephenholleran I agree that by pushing the boundaries of logger_main_config
, we are able to describe a VMM and with some additional tweaks. I am afraid that we're mixing two very different concepts under the same data model. I'd rather have this in a separated model. I also agree that this is a low hanging fruit, meaning few tweaks to have a brand new time series supported by the model. I wonder if it would make sense to do a survey in the next workshop to the resource assessment community and ask the question if people would be supportive of mixing those different concepts under the same umbrella. Also, I don't want to make this complicated so I'm flexible in accepting the proposal if a survey would make things too complicated.
Hi, I have just a few comments. I do not full know all the previous agreements and steps bu will give my opinion. Just for time series, having spatial grids would also need a separated discussion.
We are here defining as measurement and logger but we are willing to include synthetic data (virtual met mast as you say, which I don't like at all, it's fine tough). I would then prefer to refer in general as meteorological_data, for example, and add an extra field as source or generation_type which would lead to measurement or synthetic. I would then define special tags for measurement (type of, (sodar, met mast, lidar, satellite, simulation, ...). Each would then have different metadata for details about equipment/simulation.
The other thing is that for synthetic data it would be interesting to have details about the simulation. We typically use in our headers that very small but important info: final resolution, reanalysis source, model used.
We typically also separate variables and heights, so don't understand why there is WS100m when we are also labeling height 100m. It helps to have variable separated for automation purposes. What if we are having a variable with multiple heights? Can we set the height to an array-like field?
{
"author": "Vortex System",
"organisation": "Vortex",
"date": "2023-04-25",
"version": "1.2.0-2023.01",
"location": [
{
"name": "Vortex ERA5 downscaled",
"latitude_ddeg": 53,
"longitude_ddeg": -5.5,
"source": "virtual_met_mast",
"main_config": [
{
"logger_oem_id": "Other",
"logger_serial_number": "---",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"offset_from_utc_hrs": -5,
"model": "WRF",
"reanalysis": "ERA5",
"resolution": "3km"
}
],
"measurement_point": [
{
"name": "WS",
"measurement_type_id": "wind_speed",
"height_m": 100, 110, 120
"height_reference_id": "ground_level",
"logger_measurement_config": [
{
"measurement_units_id": "m/s",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"column_name": [
{
"column_names": "WS_100m","WS_110m","WS_120m",
"statistic_type_id": "avg"
}
]
}
]
},
{
"name": "WD",
"measurement_type_id": "wind_direction",
"height_m": 100,
"logger_measurement_config": [
{
"measurement_units_id": "deg",
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"column_name": [
{
"column_name": "WD_100m",
"statistic_type_id": "avg"
}
]
}
]
}
]
}
]
}
Other thing is that this is exclusively for ASCII files but we starting to disseminate in netcdf and zarr formats, which do not account for "columns". Ok, that files have it's own metadata, but having json/yaml in prior can help a lot for example in reducing querying and/or download times. May be it's another thread but would like to know how this is being treated. Thanks all for your work and time!
@oriollacave excellent remarks so thank you for sharing your opinions. I have some comments about what you posted.
@stephenholleran or @abohara anything to add here?
@oriollacave would you be open for a meeting to talk about this topic?
@kersting @oriollacave thanks for the good comments and discussion.
I agree on the need to communicate some basic details about the simulation. I can see it being communicated about the entire VMM generation process or at the each measurement point level ( or both )
For tracking simulation params / settings : I can see an additional table like logger_main_config
e.g. simulation_main_config
that describes variables related to to the overall simulation settings. Fields like, final resolution
, reanalysis source
, model used
etc. could be in this table. I think re-purposing logger_main_config
would likely require a lot of modifications that would pull it away from its original purpose of tracking a physical logger. I agree with @oriollacave and @kersting on this.
I am not sure I understand your concerns @kersting about the list vs dict, but perhaps it was based on the misinterpretation of the data model by @oriollacave ?
@oriollacave Your json
example in the post is different than what @stephenholleran posted above. His example does address some of your concerns:
measurement_station_type_id
: virtual_met_mast
: For data source is already a field (you can see in his example ). This is changed to mast, lidar etc. as neededand
temp @ 20mwould be two different
measurements. The
height_mtracks the height as a number, while the name of the measurement or column name is usually a string that each user may set using their own naming convention e.g.
100m wind speedor
wspd_100. You can see the
demo file` - I recommend pasting it into https://jsonformatter.org/@oriollacave I apologize if I misunderstood your intentions & recommendations here.
Overall, I agree with your sentiments, that this may not be the most "perfect" way to communicate the entirety of a "simulation", but the portion that is needed for resource assessment can fit into our model "sufficiently" to meet just the RA needs. For RA end user, having the data in one consistent format regardless of the source in my view has benefits that may outweigh some of the drawbacks ( which mostly seem like naming oddities at the moment ).
Thanks @oriollacave for your great constructive comments. Thanks too @abohara and @kersting.
I'll try and categorize the issues and summarize:
Spatial data Definitely a different topic for discussion. This thread is just dealing with timeseries data at a particular geographical location.
Separate variables and heights I don't think this is an issue as @kersting had pointed out that for measured wind speed values there are a lot of other statistic types and data columns all associated with the one wind speed measurement at 100m. Therefore we keep the measurement points separate so everything associated with that measurement point is gathered in the one place. I think my example may have been misleading as there was just the one measurement point.
Source or as we have it Measurement Station Type
As @abohara said we already have a field to capture 'source' and this is 'measurement_station_type_id'.
The data model current list of "measurement_station_type_id" are mast
, lidar
, sodar
, floating_lidar
and solar
.
The intention is to add reanalysis
and virtual_met_mast
to distinguish datasets that come from reanalysis and downscaling models. What the name of these are is up for debate.
reanalysis
is straightforward and should be included as an option as a lot of wind analysts just work with reanalysis data and are familiar with it. I think distinguishing between data that comes out of a downscaling WRF model and reanalysis is important even though both could be considered synthetic
or virtual_met_mast
.virtual_met_mast
, synthetic
, modeled
, simulation
or ???? I would somewhat agree with @kersting that synthetic
might be a bit to general. But then the same could be said for modeled
though I like this option. virtual_met_mast
is used around the industry and wind analysts have an understanding of what they are getting with this. @oriollacave you guys use this term too ;) https://vortexfdc.com/windsite/virtual-met-mast/ . modeled
is also pretty good as it explains that the data at this location is modelled data.How to capture simulation properties
OK, so we need to capture some simulation properties like model_used
, final_spatial_resolution
, reanalysis_source
.
As @abohara alluded to these are probably better in a separate table and not to clog up the logger_main_config
table. This can be in the same vain as the vertical_profilar_properties
and mast_properties
. I would still use the logger_main_config
for timezone and start and end dates.
Before getting into the detail of a simulation properties table, I would first like to confirm that we would all be happy to make these changes to the WRA Data Model to capture reanalysis and downscaled data? I definitely am.
Thanks all!
+1 on
Before getting into the detail of a simulation properties table, I would first like to confirm that we would all be happy to make these changes to the WRA Data Model to capture reanalysis and downscaled data? I definitely am.
Hi, Agree in most, as I'm also understanding better the schema, so thanks for the positive comments. I agree in moving forward with what proposed @stephenholleran .
Virtual met mast is fine and the better ( I don't like but it is accepted, well understood and marketing always wins).
Anyway, just a question and two comments/proposals.
Apology if I'm confused or missing something. I think @stephenholleran can move forward and use whatever from this extra information makes sense.
@oriollacave would you be open for a meeting to talk about this topic?
sure! just let me know a date
Hi @oriollacave,
Sorry for the slow response, it has been a long week with lots of IEA Task and data standards/sharing things going on all at once.
sure! just let me know a date
Our next scheduled meeting is the 22nd but that will be fully discussing the workshop which will be on the 29th. The meeting after that, on the 6th July, will be a debrief of the workshop. The next scheduled meeting would then be the 20th July at 4 pm Irish time. (I can't do the 13th as we have a company day.)
The 20th July is a bit away but it'll probably come around fast enough. Would that be ok for everyone?
Hi @oriollacave, @abohara, @kersting,
I am finally getting around to actually doing something with this.
First off I have added reanalysis
and virtual_met_mast
to the measurement_station_type
enum. I think this is fine and we have concluded on that.
Second thing is the additional table to capture the properties of either reanalysis or virtual met mast configuration. I haven't made any changes to the actual schema yet, I have just created a sample of what it might look like. You can check out this draft PR #246 to see the changes. I'll paste the extra table below for ease of discussion.
"main_config": [{
"reanalysis_source": "era5",
"final_resolution_m": "11000",
"model_used": "ECMWF IFS",
"offset_from_utc_hrs": 0,
"averaging_period_minutes": 60,
"timestamp_is_end_of_period": false,
"date_from": "2000-01-01T00:00:00",
"date_to": null,
"notes": "This is an ERA5 reanalysis dataset produced by ECMWF.",
"update_at": "2023-11-24T18:13:00"
}],
It would be good if we were able to discuss this on a call. The next scheduled one is the 7th Dec at 4pm Irish time. Let me know if you can make it?
Some points for discussion:
main_config
similar to logger_main_config
or we can be more specific and call it model_config
or simulation_config
?reanalysis_source
, final_resolution_m
and model_used
. Are these the best names? All are optional.offset_from_utc_hrs
, averaging_period_minutes
and timestamp_is_end_of_period
from the logger_main_config
table. This way we just drop that table when it is a reanalysis or VMM. This can be restricted in the JSON Schema too, if we want that?Feel free to respond here.
@stephenholleran thanks for looking into this. I'll be in the next meeting on the 7th and we can discuss more. Here are some of my remarks.
main_config
because it is more compact.native_resolution_m
property for the case of data that is downscaled and to differentiate from final_resolution_m
.Looks good to me too. Resolution is not always in m, so having this way might be only approximate. Not a big deal, just for you to know. I'm available for 7th Dec at 4pm Irish time.
As an off topic, are you thinking adding an md5 or sha hash and filename tags & filename to link to the data? That's something important for us.
Below is an abstract from our Dec 7th, 2023 bi-weekly call. https://github.com/IEA-Task-43/digital_wra_data_standard/discussions/129#discussioncomment-7778423
nudging method
, PBL schemes
, number of nests
, domain size
, orography
, land use
, etc. The feeling was that though these are important for comparing downscaling models, these details are generally available on the providers websites/scientific papers and are not necessarily needed to be included in every single dataset. Similar to lidar data, there are lots of parameters/technology that are important to compare different lidars, however these are not required for a wind analyst when performing an energy yield assessment and so are not included in the WRA Data Model. Therefore, we will leave these out for now. As we will have a new table to capture model settings (this is the big change) we can easily add any or all of these in the future quite quickly.main_config
it was decided that this may be confusing for a user as it is not specific enough. We already have logger_main_config
which a user knows they need to fill out when they have loggers but they may also think they need to fill out something in main_config
as it is not specific. Therefore, we decided we should be more specific and rename it..main_config
to model_config
as opposed to simulation_config
or synthetic_config
. Model is more generic for this type of data. An AI generated dataset is not a simulation and so should be more flexible for future use to capture AI generated datasets. The results of the modelling is synthetic data but you don't have a synthetic configuration.Below is an extract from our 18th Jan 2024 call regarding this issue. https://github.com/IEA-Task-43/digital_wra_data_standard/discussions/129#discussioncomment-8172178 Unfortunately not many people turned up so it would be good if we could continue the discussion here and come to some conclusions.
reanalysis_source
should just be reanalysis
as it is not a "source" when the data is an actual reanalysis dataset. The definition would still include "source". "The name of the reanalysis dataset that is the source or the result of this model." No consensus. Post question in issue log.final_resolution_m
what does a user put in when this changes between equator and the poles e.g. for a reanalysis dataset? It also changes between N-S and E-W?final
? Is it not obvious that the resolution we are taking about is the resolution of the model that actually resulted in the dataset that is been described in the implemented data model? I would be inclined to put the word horizontal
in there instead? No decision on the word 'final'. grid_resolution
maybe a better term or even horizontal_grid_resolution
? Been specific that it is a horizontal resolution would be preferred.
final_resolution_m
, 'm', 'km' or 'decimal degrees? 'm' might make more sense for a modelled dataset but for reanalysis this could be left out.offset_from_utc_hrs
, averaging_period_minutes
and timestamp_is_end_of_period
should all be definitions as they are now duplicated in both the 'model_config' and 'logger_main_config' tables? Yes, this is good.@oriollacave it would be great to get your input here especially!! cc @abohara, @kersting
Hey @oriollacave, would love to get your input on the above queries? I can have a call with you directly if you prefer? Let me know. Thanks,
HI! This is clearly on my TODO list in urgent folder, just below SUPER URGENT. I will work on it beginning next week.
My OPINIONS below.
Below is an extract from our 18th Jan 2024 call regarding this issue. #129 (comment) Unfortunately not many people turned up so it would be good if we could continue the discussion here and come to some conclusions.
- Should we include all available reanalysis datasets or just MERRA-2 and ERA5? Am I missing any? https://github.com/IEA-Task-43/digital_wra_data_standard/pull/246/files#diff-da1380042a0574c16546fc2a08f9f6ff97b3817ff19bf0b88ef0e846101a2824R1023 Keep the full list so it is complete.
This looks good for me. We know this will change with time.
- I think
reanalysis_source
should just bereanalysis
as it is not a "source" when the data is an actual reanalysis dataset. The definition would still include "source". "The name of the reanalysis dataset that is the source or the result of this model." No consensus. Post question in issue log.
reanalysis alone is understood. Otherwise we would have to set initial conditions and boundary conditions.
- For
final_resolution_m
what does a user put in when this changes between equator and the poles e.g. for a reanalysis dataset? It also changes between N-S and E-W?
True. This may be approximate but from my experience, more useful than degrees.
- Do we need the word
final
? Is it not obvious that the resolution we are taking about is the resolution of the model that actually resulted in the dataset that is been described in the implemented data model? I would be inclined to put the wordhorizontal
in there instead? No decision on the word 'final'.grid_resolution
maybe a better term or evenhorizontal_grid_resolution
? Been specific that it is a horizontal resolution would be preferred.
horizontal_grid_resolution is perfect.
- Units for
final_resolution_m
, 'm', 'km' or 'decimal degrees? 'm' might make more sense for a modelled dataset but for reanalysis this could be left out.
for reanalysis this is important too.
- Definition: "The final horizontal resolution, in meters, of the model used to create this dataset. If a reanalysis dataset this can be an approximation or left out."
So this would lead to:
"The final horizontal grid resolution, in meters, of the model used to create this dataset."
- I think
offset_from_utc_hrs
,averaging_period_minutes
andtimestamp_is_end_of_period
should all be definitions as they are now duplicated in both the 'model_config' and 'logger_main_config' tables? Yes, this is good.- I think it is possible in the schema to limit using only one of these tables. That is, a user can use either the 'model_config' or the 'logger_main_config' but not both. Do we want to implement this limit? Not critical but would be useful. It could be included in future. If not implemented this time around, add an issue to track it.
As there is only one dataset for a json, only one should be available. No discrepancies could occur. Does this makes sense?
@oriollacave it would be great to get your input here especially!! cc @abohara, @kersting
Hi @oriollacave,
Thanks a million for your responses. All great, just one clarifying question from me now.
For the "horizontal_grid_resolution_m" definition I think we should still include that for reanalysis datasets this can be an approximation. We can drop the optional bit. Something like
"The final horizontal grid resolution, in meters, of the model used to create this dataset. If a reanalysis dataset, this can be an approximation e.g. '50000' for MERRA-2 as it is about 50 km in the latitudinal direction."
Agree. May be I was not clear. I said that having the horizontal_grid_resolution_m for reanalysis is important too.
This is now merged into the 'dev' branch and will soon be in a new release.
From the February 2022 users workshop the below conversation occurred.
Steve Clark didn't create an issue so here it is.
I think the data model can easily handle reanalysis or VMM data. A measurement location type might be useful to distinguish from met masts, lidars, etc.?