Discussion of additional funktionalties and QC methods

maxmargraf commented 5 months ago

Here is a summary of functionality from intense-qc and its extension for sub-hourly data as a basis for discussion of what could be added to pypwsqc:

Edit:

strikethrough bullet points are rejected by Louise and Max after discussion in Prague
--> refers to simplifications

1. intense-qc check individual gauges

check if high percentiles (.99 and .95) of the rainfall time series are zero code --> simple test if single stations do not record rainfall could be implemented
~~return the largest n rainfall values code~~
check how rainfall is distributed evenly over each day of the week code --> could be implemented for history analysis
check how rainfall is didistributed over hours of day code --> could be implemented for history analysis
~~searching for data with more than 5 data gaps longer than 2 consecutive steps are missing (on an hourly basis) code~~ --> to strict, not really a qc filter, it always depend n the purpose. Could be used for online/offline analysis
~~Pettitt check (change point test) code~~ --> too sophisticated for a too unsophisticated method and we assume it requires long time series
~~check if world records (from an internally used dataset) are broken code}~~
flag too long dry or wet spells (compared to the same internal dataset) code --> with a good reference this could be usefull
check if the amount of one time step is accumulated from more time steps before because of its intensity code --> something interesting, @JochenSeidel is already working on this?!
check if high values are shown repeatedly which is unlikely code --> could make sense
check if the minimal rainfall resolution changes over time code --> could be interesting to see when PWS are calibrated

check neighboring gauges

get neighbors code
compare hourly, daily and monthly neighors

2. SubHourlyQC to use on top of intense-qc

checks if data is sub-hourly
uses thresholds to reject unplausible rainfall
various thresholds are suggested for different accumulation times (1min, 15min, 60min) and each month of the year

lepetersson commented 5 months ago

Summary of titanlib functionalities. It emphases spatial checks and is written in C++ and has bindings for python and R. It was originally developed for temperature.

Note that it deviates a bit from the checks presented in this paper . Also note that titan and titanlib are different packages, I have not clarified how they are different.

strikethrough bullet points are rejected by Louise --> refers to comments by Louise

Buddy check -compares an observation against its neighbours (i.e. buddies) and flags outliers over defined threshold --> redundant with FZ+HI filter in combination
Buddy event check - similar to the buddy check, except that observations are converted into yes/no values of exceeding a specified threshold. --> redundant with HI filter for high intensity events
Climatology range check - ~currently implemented only for temperature~
First guess test -implements a streamlined version of the Spatial Consistency Test (SCT, see below), which is less computationally intensive
Isolation check - flags stations that have fewer than num_min buddies within a specified radius --> can be easily implemented and handy to identify isolated stations, for which most filters cannot be applied
Meta data check flags stations with invalid latitude, longitude, and/or elevations --> should be handled in the data model preparation and not in QC?
Range check Flag if observation values are outside a given minimum and maximum range --> Easy to implement, redundant with other filters
Spatial Consistency Test - see publication: Lussana, C., Uboldi, F., & Salvati, M. R. (2010). A spatial consistency test for surface observations from mesoscale meteorological networks. Quarterly Journal of the Royal Meteorological Society, 136(649), 1075-1088. Exist in different versions. The SCT compares each observations to what is expected given the other observations in the nearby area. If the deviation is large, the observation is removed. The SCT uses optimal interpolation (OI) to compute an expected value for each observation. The background for the OI is computed from a general vertical profile of observations in the area.

maxmargraf commented 4 months ago

Some additional ideas came up during discussions at EGU with @JochenSeidel and Andras Bardossy regarding the indicator correlation filter in cases with no or sparse primary stations (reliable rain gauges).

The PWS with the highest indicator correlation (and potentially no complaints from all other checks) can be used as primary stations. The number of PWS assumed as primary stations could be improved iteratively.
Vice versa from a PWS-only indicator correlation, the worst PWS can be removed in an iterative process until a certain quality is reached

JochenSeidel commented 3 months ago

Another aspect which should be considered are the location information of the PWS. Especially the Netamo PWS often have identical coordinates, which indicates that probably both are located incorrectly. pandas has a function to remove duplicates, but I have not seen something similar in xarray...

cchwala commented 3 months ago

Yes, finding and removing duplicate locations should be made easy.

cchwala commented 3 months ago

~~@JochenSeidel can you post the pandas code that you used?~~ (done below with screenshot)

cchwala commented 3 months ago

Screenshot of the code, with comment based on our discussion that it should be done pair-wise which it probably is not doing with the code in the screenshot

JochenSeidel commented 3 months ago

I will check if and how this could be done pairwise, i.e. for duplicate entries in both x and y columns. This function should be split into identifying duplicate locations first and the choosing whether they should be kept/further used (e.g. for multiple CML links on a tower) on discarded, e.g. in case of Netatmo PWS, where this points to a false locations. In this case, all PWS should be removed using: df.drop_duplicates(keep=False)

cchwala commented 3 months ago

Sounds good.

This function should be split into identifying duplicate locations first and the choosing whether they should be kept/further used...

You could return the indices along the id dimension or the actual IDs which can then be used as index. The behavior should be as in pandas.DataFrame.duplicated regarding the keep argument.

JochenSeidel commented 3 months ago

I just checked the duplicatefunction from pandason a larger PWS data set (> 75.000 stations). grafik The numbers are different indicating that some PWS have only one identical 'lat' or 'lon' value If the operation is applied on both coordinate columns, fewer stations are found: grafik

cchwala commented 3 months ago

Ah, okay. So that means I was wrong with my concern about your usage of .duplicated(), right?

JochenSeidel commented 3 months ago

It looks like it does what it should do. I've created a small artificial test data set (attached) with some identical lat and/or lon values and it keeps the correct ones with unique coordinates. The single duplicate info for either lat or lon does not reveal much information though at it only counts the duplicate occurrences in the corresponding values. Before: grafik

After: grafik

cchwala commented 3 months ago

👍 Very good

JochenSeidel commented 3 months ago

Attaching the file doesn't work, but I think we are safe to proceed with this.

cchwala commented 3 months ago

Should there be a function get_duplicate_locations, which would use pandas internally, or should we just list the approach using to_dataframe() and then duplicated in pandas?

xarray has a similar function drop_duplicates but no function to just get the indices of the duplicates like DataFrame.duplicated.

JochenSeidel commented 3 months ago

The xarraxfunction works only on dimensions, not on coordinates as in our data...

cchwala commented 3 months ago

Okay. Good to know.

I suggest that you open a new issue for the final discussion and decision regarding functionality for getting duplicated coordinates. I am not yet sure if is is better to promote (in one of our example notebooks) a one-liner based on .to_dataframe and then using pandas or if we want to have one function. Let's discuss that in the new issue, which can then also be closed once things are done.

JochenSeidel commented 3 months ago

Final question: Should I open the new discussion in poligrain or here? As duplicates might also be relevant von CML I would go for poligrain

cchwala commented 3 months ago

yes, new issue in poligrain because, as you said, this will be a general functionality. please link to this issue then.

OpenSenseAction / pypwsqc

Discussion of additional funktionalties and QC methods #23