Add HI filter - Githubissues

OpenSenseAction / pypwsqc

Python package for quality control (QC) of data from personal weather stations (PWS)

https://pypwsqc.readthedocs.io

BSD 3-Clause "New" or "Revised" License

0 stars 3 forks source link

Add HI filter #16

Closed lepetersson closed 5 months ago

lepetersson commented 6 months ago

high influx filter function, returning data variable hi_flag with 0 (no high influx), 1 (high influx), -1 (not enough data to say) per PWS and time step
function currently builds on xarray.where --> change?
here is an idea of how the function can be restructured to align with other filters
added notebook with step-by-step explanation
the data preparation (parameters max_distance and n_stat, reproject data+calculate distance matrix+calculate reference/median) is common with the faulty zeroes filter and should probably be moved elsewhere

cchwala commented 6 months ago

🎉

cchwala commented 5 months ago

Note: We had to use git push --force because I did something wrong with the rebase and this was the quickest fix. Because of that the time stamps of the old commits were lost.

cchwala commented 5 months ago

@lepetersson We should disable the ruff error PD003 because it leads to false-positives when using xarray instead of pandas, for which this errors is made, see e.g. https://github.com/astral-sh/ruff/issues/8846.

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (e965b76) to head (c9173f8). Report is 4 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #16 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 2 2 Lines 23 29 +6 ========================================= + Hits 23 29 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cchwala commented 5 months ago

@lepetersson One note regarding the tests that have to be added now: As was saw when processing the example data, there are flags that are 1 and flags that are -1, as expected. But, if I recall correctly, they are not occurring for stations next each other along the id dimension. Hence, if you want to use a subset based on time and id, but with only two or three ids to keep it small, you could also use ids which are not appearing next to each other. Something like ds_pws.sel(id=['ams13', 'ams88', 'ams102']) works to do that.

cchwala commented 5 months ago

The error in the test here

>        hi_array = (condition1 | condition2).astype(int)
>        hi_array.data[nbrs_not_nan < nstat] = -1
E       ValueError: invalid __array_struct__

is maybe there because you pass a numpy.array in the test function but you use xarray.DataArrays in the example notebook.

I recall that I added the hi_array.data[] statement, but was not sure if that works correctly... 😇 🙈

One solution could be to make sure taht hi_array is a numpy.array when doing the indexing. You could do that by doing something like

hi_array = np.asarray(hi_array)

before doing hi_array.data[...] = -1.

cchwala commented 5 months ago

Thanks @lepetersson 🍾 🍾 🍾