Structure and naming conventions in the schema to cover data availability

jonssonchristian commented 11 months ago

I introduced input_characteristics.data_availability under the wind resource assessment object to cover the raw data availability for wind measurment data, reference meteorological data and reference operational wind farm data. Previously this was covered only for operational wind farm datasets.

We need to review and discuss the structure and naming conventions for this element. We may want to refine it to be clearer and most useful.

The term input_characteristics aims to capture details that characterise the inputs to the assessment, and which are not already captured in the varius metadata elements (measurement station metadata, reference meteorological dataset metadata and reference operational wind farm metadata). The idea is to group different input characteristics together in this object, just like we group results for different quantities into the results object. I did not find it easy to come up with a clear and concise term, and input_characteristics was a compromise for an initial draft. If anyone can think of a better term, please make a suggestion. The idea is that this object should capture details about the raw input datasets, before the author of the EYA has undertaken any data filtering or other processing. One idea could perhaps be just inputs, but that might tend to be interpreted like the inputs themselves rather than metadata about the inputs.

At the moment the input_characteristics group only has a single child item data_availability. I cannot immediately think of other statistics we might add there, but thought it make sense to have a group that can be added to later on in case we decide later we want to enrich it with more detail. Even if it only remains one item, I think it makes more sense to have a group that adds context rather than having data availability datasets directly under the wind resource assessment, since for the results we have a group.

The field name for raw data availability is currently just data_availability rather than raw_data_availability. The reason is that the group input_characteristics is defined to cover only characteristics of the raw input datasets, and so it should be clear from the context that it can be nothing else than raw data availability. I am generally in favour of nesting fields into groups, where parent groups add context to clarify the interpretation of different fields, rather than having long field names with all the context. For example we have results.wind_speed and results.turbulence_intensity rather than wind_speed_results and turbulence_intensity_results. Let me know if you have a different view on this.

We currently do not capture processed data availability. Such data are of course also relevant to EYA reporting. The question is whether we can come up with clear enough definitions and data structures to make processed data availability data in the EYA DEF useful. My current view is that it seems better to leave this for a later version, when we also expand the EYA DEF to cover the details of the wind resource assessment process. However, I have no strong opinion there and would welcome proposals for how we can incorporate it at this stage in a simple enough form.

jonssonchristian commented 7 months ago

It is considered bad practice in taxonomy/ontology development to introduce completely new terms that are not already widely used in the domain. I think the same applies here and would say input_characteristics is a bad choice of field name since no one currently uses it regularly and will understand what it means without reading a definition.

jonssonchristian commented 6 months ago

I had a further think about this and would suggest that dataset_statistics is a better section name to cover statistics in relation to the input datasets, such as data availability.

I propose we limit it to the (raw) data availability for now, but that section could easily be extended later to cover different measures of data coverage and quality.

jonssonchristian commented 6 months ago

Pull request #57 includes a proposed change to address this issue.

IEC-61400-15 / eya_def

Structure and naming conventions in the schema to cover data availability #33