Correction and/or discarding of erroneous measurements

BaptisteVandecrux commented 4 years ago

I started to list erroneous or biased measurements emerging from the comparison with PROMICE stations:

https://github.com/BaptisteVandecrux/GC-Net-evaluation/issues

Do you plan to correct these issues in the near-future? We are working, at PROMICE, on a python toolbox for the reporting/correction of this type of measurement error. We could join forces if you are interested.

Ionut:

We plan to work with MeteoIO, but please do send us the link to your toolbox and Rebecca will evaluate it (as it might be useful for reporting/correction of the realtime data for other WSL projects). So there might be potential for joining forces.

Derek:

Thank you for the report. I will try to make these corrections in the reprocessing. I am not sure yet the best way to make these station and instrument specific corrections, whether its in meteoIO custom routines or a custom python pre- or post-processor. We will also need to document these corrections. I hope that in November I can finally devote a significant amount of time to the reprocessing. Maybe we can collaborate through this process and perhaps I can also get some help from Rebecca.

BaptisteVandecrux commented 4 years ago

More thoughts on this:

1) What about separating the different processing levels into separate data products:

L0: uncorrected, unfiltered data
L1: filtered, adjusted data This would allow the data user to know exactly what they are using and that a L1 product implies the intervention of a person to set the filter's characteristics and the correction routines.

2) Rotating wind direction measurements, filtering and adjusting height, discarding a certain time period (... etc.) are tasks that we currently conduct on PROMICE stations and that we will have to conduct at GEUS in the future for GC-Net stations. At the moment it is done in a not-so-transparent way. But we are working on a processing framework that would make those adjustments readable and automatic. It would be great if we could build a common tool for this. Unfortunately, we are not so familiar with C++ at GEUS. So it will be hard to adapt custom-made routines in MeteoIO to match our future need. Python processing tools would suit us better.

BaptisteVandecrux commented 4 years ago

An attempt to flag data issues in PROMICE dataset: https://github.com/GEUS-PROMICE/PROMICE-AWS-data-issues

and some discussion about it: https://github.com/GEUS-PROMICE/PROMICE-AWS-data-issues/issues/9

bavay commented 4 years ago

@BaptisteVandecrux If you don't mind me jumping in... I have been reluctant to do so, because I don't want to give the impression that I'm trying to shove my solution down your throat, but on the other hand I want to dissipate any misunderstandings. So, I am the main developer of MeteoIO. I have been working on it almost full time for the last 12 years. It consists of ~54k lines of code (ie without the comments). The whole point of it is data preprocessing in a robust, fast and flexible way. Basically, nothing is hard-coded but you declare what you want to do in a configuration file. This means that even if you don't know any c++ (and most of our users don't have a clue about c++), you can fully use it and tailor the processing to your needs: you just declare which algorithms you want to apply on your data, per parameter (and you can stack multiple algorithms). Of course, if you need an algorithm that is not available, then you need to implement it (or have somebody do it for you, we've had several algorithms implemented by contractors). You can have a look at the algorithms that are already implemented for data corrections in the documentation. As you can see, the goal is not to have custom processing but to stack multiple generic processing algorithms to reach your goal (so far, we have only one processing algorithm that is not fully generic: the grass removal is tailored for the northern hemisphere because it assumes minimum and maximum dates for a winter season).

In contrast to custom-made processing tools like jaws, it offers you much more flexibility (in jaws the processing is hard-coded) so you can reuse MeteoIO anywhere with any kind of data. But this comes with a price: you have to decide which processing you want to apply (ie you have to write your configuration file) and keeping things more modular and generic can make the implementation more challenging. For example, MeteoIO accepts any sampling rate, including irregular sampling rate (ie the data is measured at almost random intervals), so a filter like a low pass filter had to first check if the sampling rate is appropriate for the filtering period for all new data points... Moreover, where in jaws you can automatically get some reanalysis data to help with a tilt correction, it is way less natural in MeteoIO (since nothing is hard-coded and everything should work anywhere, anytime as much as possible, it would be much better to find another way to compute the tilt, that would not depend on downloading large amount of data).

A few places use MeteoIO to standardize / filter / correct their data (as well as report data quality), I also do use it for our operational data (so we can see when a sensor starts to fail) as well as to prepare the data for our operational models or to prepare datasets for further consumption (data coming directly from research AWS and provided to the data owners). You can download the GUI for it (Inishell) and start playing to have a feeling of its strength and weaknesses (Inishell comes with a version of MeteoIO embedded).

And now, it's time for me to stop my shameless advertising rant and go back to coding...

BaptisteVandecrux commented 4 years ago

Hi Mathias, Thank you for joining the conversation! It is great to hear the voice of experience and I really don't want to be reinventing the wheel. I have seen MeteoIO working great for simple actions like filtering, resampling, conversions... etc. I completely agree that it should be used for such tasks.

However there are many site-specific, period-specific custom-made actions that we need to apply to the GC-Net (and PROMICE) data. Maybe it would be faster if I ask you directly if we can conduct them using meteoIO:

Affine transform of a specific variable between two specific dates (example)
Temporal shift of a variable within the dataset (example)
Manual flagging of suspicious measurements between two dates (example)
Manual swapping of data between variables (e.g. during a given period, move data from T1 to T2 and from T2 to T1)
Different filtering/resampling/conversion strategies for specific periods and for specific sites.

We are looking for an approach that is transparent (these processing steps need to be reported to the users since they alter/improve the original data), modular (so that we can append new custom-made processing steps when and where needed) and easy to re-run every times a new release is being processed.

Is that possible at the moment with meteoIO?

Thanks again for your help!

iosifescu commented 4 years ago

If you don't mind me joining the conversation, the main advantage of meteoIO is that it is fast - really fast. The Matlab processing took something like 15 minutes or so, and the simple meteoIO tests that I've made took a few seconds. It was so fast that initially I thought the software didn't even run properly through the data (but it did). Even after adding more complicated filters, I still have the impression will still be fast enough so you could re-process an entire station (with all historical data from the beginning) every 10 minutes, with a lot of time to spare...

iosifescu commented 4 years ago

Transparency and especially reproducibility are also important points for us in EnviDat - and meteoIO supports just that. If Derek agrees, I will encourage him to also publish the meteoIO config files, together with the L0 data (however L0 data can only be made available from 2008 onwards, as prior to that, there were old loggers, which delivered values without proper timestamps so the original data is quite hard to decipher/understand - Derek could provide you more information about that).

bavay commented 4 years ago

@BaptisteVandecrux SOrry to be late to answer, I was sick (not covid but sitll unpleasant!). So, here are my detailed answers:

Affine transform of a specific variable: this can be done by using two filters, one after the other: ADD then MULT. This could look like this (here for the air temperature): TA::FILTER1 = ADD TA::ARG1::TYPE = Cst TA::ARG1::CST = 0.250000 TA::FILTER2 = MULT TA::ARG2::TYPE = Cst TA::ARG2::CST = 1.100000
Temporal shift of a variable: this is not implemented yet but I can do it very quickly. I never thought this could be needed. In which case do you need it (out of curiosity)? and then I'll implement it! (really, it should take me less than an hour, but I would like to give an example of why it is useful in the documentation)
Manual flagging of suspicious measurements: this is currently not supported, but you can manually delete measurements (either all or selected parameters) at specific timestamps or within time periods using the SUPPR filter (either on the time variable, for all or on a specific parameter and it can also been restricted to specific stations)
Manual swapping of data between variables: this is currently implemented on a per station basis but for all timestamps with a MOVE command, for example "TA::MOVE = air_temp lutttemperature" would rename all "air_temp" and "lutttemperature" parameters as TA. I think that being able to restrict it to some specific time periods is a good idea (I've just had this need on a dataset I am currently preparing for publication).
Different filtering/resampling/conversion strategies for specific periods and for specific sites: all filters can be restricted to specific station IDs and time periods, for example (here to apply a Hellmann Shielded rain gauge undercatch correction as per (Goodison, 1999) but only on the stations DAV and WFJ and for two specific time periods): PSUM::FILTER1 = UNDERCATCH_WMO PSUM::ARG1::TYPE = Hellmannsh PSUM::ARG1::ONLY = DAV WFJ PSUM::ARG1::WHEN = 2020-07-01 - 2020-07-11T05:50 , 2020-08-01 - 2020-08-15
still on the previous point: if the input data format has changed over time, it is also possible to declare multiple input data plugins. This way, you can mix data from different sources (like from csv files and from a database) or widely different configurations of the same kind of data source (like very different CSV files that can not share even a little bit of the same configuration).

For what I see, the swapping of data between variables restricted to some specific time periods would be the most time consuming to implement. Not so much for technical reasons (I have all the supporting infrastructure to only apply a specific processing to a given set of time periods) but because it shows that instead of having these low-level steps in the Input sections, it deserves a section on its own. In the documentation I called it "raw data processing" but it does not have its own section and I now see that it should (and the syntax should then be made more similar to the filters). Not a terrible amount of work, but more than just adding one trivial feature. On the other hand, since I've just had a case that could benefit from it, I am fully convinced of its merits.

bavay commented 4 years ago

I'll be in vacation next week, but after I will move forward to restructure my "raw data processing". I'm still busy with my dataset preparation and I need once more the possibility to perform such transformations only on a given date range, so I'm now feeling that it is time to do it (I want to publish this dataset with the ini files that have been used to generate it, so I need to have clean solutions for all issues that we had with this data. Since we did a mess with a few stations, I need these kind of features... This has always been the kind of driver for MeteoIO's development: instead of implementing a hacked together script to do it, implement it in MeteoIO so it can be reused in the future).

BaptisteVandecrux commented 4 years ago

Thanks Mathias for the explanation!

It seems that you can modify MeteoIO to do the job quite easily. Also, I wasn't aware that you were directly in charge of the GC-Net dataset preparation. This discussion started because I first thought that I would need to build my own tools to process GC-Net data. It seems that it is not necessary any more and that I can just wait for the processed data.

Please see all the issues the I will be listing here: https://github.com/BaptisteVandecrux/GC-Net-evaluation/issues Most of them should be easily fixed with meteoIO routines. Feel free to contact me if you need help to understand/correct these erroneous measurements.

As with regards to PROMICE, I first thought we could use the same processing tools as for GC-Net stations. But, although it would have made my life easier (I compare PROMICE and GC-Net station measurements), the rest of the PROMICE team do not have time at the moment to re-build their framework to use meteoIO.

Thanks again for taking the time to present meteoIO, which is indeed a powerful tool!

bavay commented 4 years ago

@BaptisteVandecrux Hi! I've finally implemented (almost all) the changes I wanted to see in the handling of what I call "Input Data Editing". It is now more flexible (as any number of such editing, in any order can be defined, per station ID) and supports time restrictions (so such an editing can be restricted to some specific time periods). For example, to swap two radiation sensors (incoming and reflected) on station STB2 for two time periods: [InputEditing] STB2::EDIT1 = SWAP STB2::ARG1::DEST = ISWR STB2::ARG1::SRC = RSWR STB2::ARG1::WHEN = 2019-07-01T13:30 - 2019-08-15 , 2020-03-15 - 2020-04-01T14:52

or to delete the incoming radiation and take it from station SLF2: [InputEditing] STB2::EDIT1 = EXCLUDE STB2::ARG1::EXCLUDE = ISWR STB2::EDIT2 = MERGE STB2::ARG2::MERGE = SLF2 STB2::ARG2::PARAMS = ISWR STB2::ARG2::MERGE_STRATEGY = STRICT_MERGE STB2::ARG2::MERGE_CONFLICTS = CONFLICTS_PRIORITY

For the other example problems you had, this would be addressed by using filters: [Filters] RH::FILTER1 = ADD RH::ARG1::TYPE = Cst RH::ARG1::CST = 0.200000 RH::ARG1::WHEN = 2019-07-01T13:30 - 2019-08-15

The only thing that is still missing is the time shift; although there is already an UnDST filter, it needs some renewed attention for the same reason: I still need to figure out what is the best way to handle potential data overlap after applying a time shift with a time restriction. For example, if I shift all data by 1 day up to 2020-07-01T12:00 and already have some data between 2020-07-01T12:00 and 2020-07-02T12:00, how should I handle this? I could say that the shifted data silently overwrites the old data or I could handle this like the merge conflicts in the Input Data Editing (with a choice of options and warnings being printed on the screen). I also need to take care to other edge effects, such as requesting data from 2020-07-01T13:00 that would need to request older data and time-shift it...

bavay commented 3 years ago

I have made a commit today where I change the name of the UnDST filter into "Shift" since it can be used to perform arbitrary time shifts at any number of periods. This could be used for your data that needs to be shifted in time (the other issues are addressed by an "ADD" filter on the chosen parameter where you just add an offset to the value for any given station and time range(s)).

EnviDat / monitoring-backend

Correction and/or discarding of erroneous measurements #4