OpenSenseAction / pypwsqc

Python package for quality control (QC) of data from personal weather stations (PWS)
https://pypwsqc.readthedocs.io
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Discussion of additional funktionalties and QC methods #23

Open maxmargraf opened 5 months ago

maxmargraf commented 5 months ago

Here is a summary of functionality from intense-qc and its extension for sub-hourly data as a basis for discussion of what could be added to pypwsqc:

Edit:

1. intense-qc check individual gauges

check neighboring gauges

2. SubHourlyQC to use on top of intense-qc

lepetersson commented 5 months ago

Summary of titanlib functionalities. It emphases spatial checks and is written in C++ and has bindings for python and R. It was originally developed for temperature.

Note that it deviates a bit from the checks presented in this paper . Also note that titan and titanlib are different packages, I have not clarified how they are different.

strikethrough bullet points are rejected by Louise --> refers to comments by Louise

maxmargraf commented 4 months ago

Some additional ideas came up during discussions at EGU with @JochenSeidel and Andras Bardossy regarding the indicator correlation filter in cases with no or sparse primary stations (reliable rain gauges).

JochenSeidel commented 3 months ago

Another aspect which should be considered are the location information of the PWS. Especially the Netamo PWS often have identical coordinates, which indicates that probably both are located incorrectly. pandas has a function to remove duplicates, but I have not seen something similar in xarray...

cchwala commented 3 months ago

Yes, finding and removing duplicate locations should be made easy.

cchwala commented 3 months ago

@JochenSeidel can you post the pandas code that you used? (done below with screenshot)

cchwala commented 3 months ago

Screenshot of the code, with comment based on our discussion that it should be done pair-wise which it probably is not doing with the code in the screenshot

Bildschirmfoto 2024-05-14 um 11 02 14
JochenSeidel commented 3 months ago

I will check if and how this could be done pairwise, i.e. for duplicate entries in both x and y columns. This function should be split into identifying duplicate locations first and the choosing whether they should be kept/further used (e.g. for multiple CML links on a tower) on discarded, e.g. in case of Netatmo PWS, where this points to a false locations. In this case, all PWS should be removed using: df.drop_duplicates(keep=False)

cchwala commented 3 months ago

Sounds good.

This function should be split into identifying duplicate locations first and the choosing whether they should be kept/further used...

You could return the indices along the id dimension or the actual IDs which can then be used as index. The behavior should be as in pandas.DataFrame.duplicated regarding the keep argument.

JochenSeidel commented 3 months ago

I just checked the duplicatefunction from pandason a larger PWS data set (> 75.000 stations). grafik The numbers are different indicating that some PWS have only one identical 'lat' or 'lon' value If the operation is applied on both coordinate columns, fewer stations are found: grafik

cchwala commented 3 months ago

Ah, okay. So that means I was wrong with my concern about your usage of .duplicated(), right?

JochenSeidel commented 3 months ago

It looks like it does what it should do. I've created a small artificial test data set (attached) with some identical lat and/or lon values and it keeps the correct ones with unique coordinates. The single duplicate info for either lat or lon does not reveal much information though at it only counts the duplicate occurrences in the corresponding values. Before: grafik

After: grafik

cchwala commented 3 months ago

👍 Very good

JochenSeidel commented 3 months ago

Attaching the file doesn't work, but I think we are safe to proceed with this.

cchwala commented 3 months ago

Should there be a function get_duplicate_locations, which would use pandas internally, or should we just list the approach using to_dataframe() and then duplicated in pandas?

xarray has a similar function drop_duplicates but no function to just get the indices of the duplicates like DataFrame.duplicated.

JochenSeidel commented 3 months ago

The xarraxfunction works only on dimensions, not on coordinates as in our data...

cchwala commented 3 months ago

Okay. Good to know.

I suggest that you open a new issue for the final discussion and decision regarding functionality for getting duplicated coordinates. I am not yet sure if is is better to promote (in one of our example notebooks) a one-liner based on .to_dataframe and then using pandas or if we want to have one function. Let's discuss that in the new issue, which can then also be closed once things are done.

JochenSeidel commented 3 months ago

Final question: Should I open the new discussion in poligrain or here? As duplicates might also be relevant von CML I would go for poligrain

cchwala commented 3 months ago

yes, new issue in poligrain because, as you said, this will be a general functionality. please link to this issue then.