implement first complete version of WG1 data format for CMLs in data transformation code

OpenSenseAction / OPENSENSE_sandbox

Collection of runable examples with software packages for processing opportunistic rainfall sensors

BSD 3-Clause "New" or "Revised" License

13 stars 16 forks source link

implement first complete version of WG1 data format for CMLs in data transformation code #16

Closed cchwala closed 1 year ago

cchwala commented 2 years ago

Based on the final decision of the data format for CML, PWS and SML, the existing data transformation code has to be adjusted.

EDIT: The current work was very much focused on instantaneous CML data from the OpenMRG dataset, hence, this issue was renamed to have a narrower focus and finally closed.

cchwala commented 2 years ago

Something like this should work to set the correct time encoding

ds.time.attrs['unit'] = 'seconds since 1970-01-01'

eoydvin commented 2 years ago

Seems like it should be set in the encoding, like this: ds.time.encoding['units'] = "seconds since 1970-01-01 00:00:00"

nblettner commented 2 years ago

There is a PR for a preliminary implementation. It includes an adjustment of variable names that are present in the example datasets and which I can clearly identify (pmin and pmax are not yet changed).

The datasets do, however, not all fulfill the requirements. E.g. tsl and rsl are missing in some datasets. These only include total loss (I now used tl as the variable name) which is not in the table of variables in the white paper. Moreover, pmin, and pmax are not in the list of possible variables.

The updated code also assigns attributes to the datasets. However, the dictionary for attributes of the data variables is not yet complete. I.e. part of the table of the white paper that defines variable conventions still needs to be implemented.

nblettner commented 2 years ago

PR https://github.com/OpenSenseAction/OPENSENSE_sandbox/pull/37 has been merged. WG1 data format for CMLs is mostly implemented. Now, also variables pmin and pmax are changed to rsl_min and rsl_max, respectively, and the dictionary defining attributes is complete.

Still missing is the following:

cml_id in the OpenMRG dataset (so far only sublink_id)
required variables tsl and rsl in the Czech data sets (now only tl)

cchwala commented 2 years ago

Thanks for the summary.

We will keep this Issue open to track what has still to be done

cchwala commented 1 year ago

Still missing is the following:

* `cml_id` in the OpenMRG dataset (so far only `sublink_id`)

* required variables `tsl` and `rsl` in the Czech data sets (now only `tl`)

The first point will be solved by #51.

@fenclmar: If I understand correctly the second point about missing tsl and rsl in the Czech CML datasets cannot be fixed since the variables are missing in the raw data. Correct? If so, we leave it like that and maybe just raise a warning when transforming the data, or maybe just silently ignore this issue since it will be apparent from the resulintg xarray.Dataset that only tl is there. Please comment.

cchwala commented 1 year ago

@fenclmar Two other points

We have min-max CML data from the Netherlands that can be used in the sandbox. Is the WG1 data format specification finished regarding naming of the min-max data?
What is the status regarding PWS and SML data? Do we want to also show example data with perfect formating and naming conventions now for the milestone v0.1 at meeting Krakow 2023? If not, we can rename this issue to focus on CML and OpenMRG data and then track the other open tasks regarding WG1 data format in a new issue.

cchwala commented 1 year ago

There was a lot of work on this in #62 and we have everything working for our most important CML dataset, the OpenMRG data.

I will rename this issue to be more specific and the close it.

We should open a new issue to discuss the next steps regarding data format, e.g. which PWS data shall be transformed as an example, how and with which code. Maybe this should be done in a separate repo then after doing #7.