GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)
https://pypromice.readthedocs.io
GNU General Public License v2.0
14 stars 4 forks source link

How to handle time averaging when missing samples? #30

Open mankoff opened 4 years ago

mankoff commented 4 years ago
TIMESTAMP RECORD MinutesInYear AirPressure_Avg Temperature_Avg Temperature2_Avg RelativeHumidity_Avg WindSpeed
2016-05-01 14:30:00 51 176540 724.3578 -20.10127 -19.557 54.09529 1.062
2016-05-01 14:40:00 52 176550 724.069 -19.78748 -19.11478 51.7011 0.918
2016-05-01 14:50:00 53 176560 724.4035 -19.30627 -18.91911 50.23201 0.636

See first three samples of EGP 2016 raw. The file starts at 14:30. How should the hourly samples be computed here? Currently the 3 samples are averaged and reported as the hourly average.

Currently any number of samples > 0 (i.e. 1 through 6) are acceptable to compute hourly average. Daily and monthly averages have different requirements:

https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/a1508a6b06dc1ce749b0fa95c43a7879cb0993f1/IDL/AWSdataprocessing_v3.pro#L980-L986

PennyHow commented 1 year ago

Currently, resampling to daily and monthly L3 products is performed here in the resampleL3 function:

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/2331c02e1e121648c06dd054d370f0b99a6a8f6d/src/pypromice/aws.py#L640

@BaptisteVandecrux has previously mentioned that resampling should only occur if the time step has 90% (?) data coverage. Either we could 1. simply return nan entries if there is any nan value over a given time step, like this:

ds_d = ds_h.to_dataframe().resample(t).mean(skipna=False)

This would mean that data coverage has to be 100% for resampling to occur, which might produce a lot of nan entries in the daily and monthly products.

Or we could 2. implement that resampling should occur if a given time step has over a certain number of non-nan values (i.e. less than X nan values, 5 in the example below), with something like this:

threshold=5
ds_d = ds_h.to_dataframe.resample(t).apply(lambda x: x.mean() if x.isnull().sum() <= threshold else np.nan)

I just need to figure out the threshold for hourly-to-daily resampling and hourly-to-monthly resampling. I don't remember if @mankoff had a smarter solution for this already though.

mankoff commented 1 year ago

I do not have a solution. I note that the Pandas rolling function has an option for min_periods which may be useful, but you`d have to do something else to have discrete sampling steps, not a rolling window. The one line from Penny above looks good. As for cutoff values... I have no opinion but note that this may be variable-specific. Variables with higher variability should have a higher threshold. More stable variables may only need one sample.

PennyHow commented 1 year ago

I like the idea of having different thresholds for variables based on variability - we could define these thresholds in our variables.csv look-up table. However, this requires a little more thought on how to implement - it's not a simple one-liner.

For now, I will implement something along the lines of what I originally outlined, and then we can re-visit this later