How to handle time averaging when missing samples?

mankoff commented 4 years ago

TIMESTAMP	RECORD	MinutesInYear	AirPressure_Avg	Temperature_Avg	Temperature2_Avg	RelativeHumidity_Avg	WindSpeed
2016-05-01 14:30:00	51	176540	724.3578	-20.10127	-19.557	54.09529	1.062
2016-05-01 14:40:00	52	176550	724.069	-19.78748	-19.11478	51.7011	0.918
2016-05-01 14:50:00	53	176560	724.4035	-19.30627	-18.91911	50.23201	0.636

See first three samples of EGP 2016 raw. The file starts at 14:30. How should the hourly samples be computed here? Currently the 3 samples are averaged and reported as the hourly average.

Currently any number of samples > 0 (i.e. 1 through 6) are acceptable to compute hourly average. Daily and monthly averages have different requirements:

https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/a1508a6b06dc1ce749b0fa95c43a7879cb0993f1/IDL/AWSdataprocessing_v3.pro#L980-L986

PennyHow commented 1 year ago

Currently, resampling to daily and monthly L3 products is performed here in the resampleL3 function:

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/2331c02e1e121648c06dd054d370f0b99a6a8f6d/src/pypromice/aws.py#L640

@BaptisteVandecrux has previously mentioned that resampling should only occur if the time step has 90% (?) data coverage. Either we could 1. simply return nan entries if there is any nan value over a given time step, like this:

ds_d = ds_h.to_dataframe().resample(t).mean(skipna=False)

This would mean that data coverage has to be 100% for resampling to occur, which might produce a lot of nan entries in the daily and monthly products.

Or we could 2. implement that resampling should occur if a given time step has over a certain number of non-nan values (i.e. less than X nan values, 5 in the example below), with something like this:

threshold=5
ds_d = ds_h.to_dataframe.resample(t).apply(lambda x: x.mean() if x.isnull().sum() <= threshold else np.nan)

I just need to figure out the threshold for hourly-to-daily resampling and hourly-to-monthly resampling. I don't remember if @mankoff had a smarter solution for this already though.

mankoff commented 1 year ago

I do not have a solution. I note that the Pandas rolling function has an option for min_periods which may be useful, but you`d have to do something else to have discrete sampling steps, not a rolling window. The one line from Penny above looks good. As for cutoff values... I have no opinion but note that this may be variable-specific. Variables with higher variability should have a higher threshold. More stable variables may only need one sample.

PennyHow commented 1 year ago

I like the idea of having different thresholds for variables based on variability - we could define these thresholds in our variables.csv look-up table. However, this requires a little more thought on how to implement - it's not a simple one-liner.

For now, I will implement something along the lines of what I originally outlined, and then we can re-visit this later

GEUS-Glaciology-and-Climate / pypromice

How to handle time averaging when missing samples? #30