Open mankoff opened 4 years ago
Currently, resampling to daily and monthly L3
products is performed here in the resampleL3
function:
@BaptisteVandecrux has previously mentioned that resampling should only occur if the time step has 90% (?) data coverage. Either we could 1. simply return nan entries if there is any nan value over a given time step, like this:
ds_d = ds_h.to_dataframe().resample(t).mean(skipna=False)
This would mean that data coverage has to be 100% for resampling to occur, which might produce a lot of nan entries in the daily and monthly products.
Or we could 2. implement that resampling should occur if a given time step has over a certain number of non-nan values (i.e. less than X nan values, 5 in the example below), with something like this:
threshold=5
ds_d = ds_h.to_dataframe.resample(t).apply(lambda x: x.mean() if x.isnull().sum() <= threshold else np.nan)
I just need to figure out the threshold for hourly-to-daily resampling and hourly-to-monthly resampling. I don't remember if @mankoff had a smarter solution for this already though.
I do not have a solution. I note that the Pandas rolling
function has an option for min_periods
which may be useful, but you`d have to do something else to have discrete sampling steps, not a rolling window. The one line from Penny above looks good. As for cutoff values... I have no opinion but note that this may be variable-specific. Variables with higher variability should have a higher threshold. More stable variables may only need one sample.
I like the idea of having different thresholds for variables based on variability - we could define these thresholds in our variables.csv look-up table. However, this requires a little more thought on how to implement - it's not a simple one-liner.
For now, I will implement something along the lines of what I originally outlined, and then we can re-visit this later
See first three samples of EGP 2016 raw. The file starts at
14:30
. How should the hourly samples be computed here? Currently the 3 samples are averaged and reported as the hourly average.Currently any number of samples > 0 (i.e. 1 through 6) are acceptable to compute hourly average. Daily and monthly averages have different requirements:
https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/a1508a6b06dc1ce749b0fa95c43a7879cb0993f1/IDL/AWSdataprocessing_v3.pro#L980-L986