Fixing flagged data - Githubissues

mankoff commented 4 years ago

This is closely related to flagging data #18

In addition to flagging, we want to FIX data.

One example is when we see that the station was pointing in the wrong direction

This could be flagged (no flag for this is currently defined, but perhaps we add a ROTATION flag).

A simple fix is to rotate it so the mean wind direction matches the historical mean. Or rotate it based on the reported direction minus 180 °. Or rotate it based on the measured direction at the next visit, etc.

It is probably difficult to meta-program fixes via database entries (e.g. ROTATION flag, FIX value) because the mathematical operation to apply fixes are diverse. Some are additive (add the rotation offset), others may be multiplicative, or benefit from linear interpolation, or complex functions of other variables to estimate a bad value, etc.

The solution here will need to include a database and code. Perhaps the database is similar to (or part of) the flags DB: What sensor(s), time(s) and station(s) have which problem(s). Perhaps a new field is a function name (possibly with one or more values defined in the fix DB). Those functions, which must be implemented in the code, are then called with the correct arguments.

BaptisteVandecrux commented 3 years ago

Here is some example of flag-fix DB (check out ReadMe): https://github.com/GEUS-PROMICE/AWS-data/tree/main/flag-fix Any suggestion on how they should be called, on their structure or how they should be used?

BaptisteVandecrux commented 3 years ago

Function using the data base for the "add" function. I am planning to add cases with "multiply", "rotate", "smooth", "custom_function_1" as functions.

def adjust_data(df, site):
    df_out = df.copy()
    if not os.path.isfile('metadata/flag-fix/'+site+'.csv'):
        print('No erroneous data listed for '+site)
        return df_out

    adj_info = pd.read_csv('metadata/flag-fix/'+site+'.csv')
    adj_info=adj_info.sort_values(by=['variable','t0']) 
    adj_info.set_index(['variable','t0'],drop=False,inplace=True)

    for var in np.unique(adj_info.variable):
        if var not in df.columns:
            print(var+' not in datafile')
            continue
        else:
            print('Adjusting '+var)
        for t0, t1, func, val in zip(adj_info.loc[var].t0,
                                     adj_info.loc[var].t1,
                                     adj_info.loc[var].adjust_function,
                                     adj_info.loc[var].adjust_value):
            print(t0,func,val)
            if np.isnan(t1):
                t1 = df_out.time[-1].isoformat()
            if func == 'add': 
                df_out.loc[t0:t1,var] = df_out.loc[t0:t1,var].values + val

        fig = plt.figure()
        df[var].plot(label='before adjustment')
        df_out[var].plot(label='after adjustment')        
        plt.xlabel('Time')
        plt.ylabel(var)
        plt.legend()
        plt.tight_layout()
        fig.savefig('figures/'+site+'_adj_'+var+'.jpeg')
    return df_out

Example for DepthPressureTransducer_Cor at KAN_L.

In KAN_L.csv

t0,t1,variable,adjust_function,adjust_value,comment,URL_graphic
2016-07-27T00:00:00+00:00,,DepthPressureTransducer_Cor(m),add,-6.297000000000001,manually adjusted by bav,https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_dpt_1.png
2016-07-29T00:00:00+00:00,,DepthPressureTransducer_Cor(m),add,-0.1,manually adjusted by bav,https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_dpt_1.png
2019-07-11T00:00:00+00:00,,DepthPressureTransducer_Cor(m),add,-4.478,manually adjusted by bav,https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_dpt_1.png

Result:

BaptisteVandecrux commented 3 years ago

I would like to get some feedback about the data adjustment files:

How do we call them?
Where are they stored?
Are their format OK?
Are they being handled smartly by the adjustment script?

Current state of https://github.com/GEUS-PROMICE/PROMICE-AWS-toolbox :

adjust_data

[to do: add more adjustment functions (rotation, smoothing... etc)]

Illustration:

This function reads the station-specific adjustment files metadata/flag-fix/\.csv where the required adjustments are reported for each variable.

These error files have the following structure:

t0	t1	variable	adjust_function	adjust_value	comment	URL_graphic
2017-05-23 10:00:00	2017-06-10 11:00:00	DepthPressureTransducer_Cor	add	-2	manually adjusted by bav	https://raw.githubusercontent.com/GEUS-PROMICE/PROMICE-AWS-toolbox/master/figures/UPE_L_adj_DepthPressureTransducer_Cor(m).jpeg
...	...	...	...	...	...	...

with

field	meaning
t0	ISO date of the begining of flagged period
t1	ISO date of the end of flagged period
variable	name of the variable to be flagged. [to do: '*' for all variables]
adjust_function	function that needs to be applied over the given period: - add - filter_min - filter_max - rotate - smooth
adjust_value	input value to the adjustment function
comment	Description of the issue
URL_graphic	URL to illustration or Github issue thread

The file is comma-separated:

t0,t1,variable,adjust_function,adjust_value,comment,URL_graphic
2015-03-01T00:00:00+00:00,,DepthPressureTransducer_Cor(m),add,2.3,manually adjusted by bav,https://github.com/GEUS-PROMICE/PROMICE-AWS-toolbox/blob/master/Report_toc.md#s15-2-1
...

The function adjust_data then applies the given function to the given variable in the dataframe. The adjusted variable is named \_adj in the final dataframe. The original data is kept.

mankoff commented 3 years ago

This all looks good and I suggest you keep using it for now if it works for you. Given that this is early in the development phase, I'm assuming this will all be re-written at some point later.

I will need to use this fixing flagged data function at some point early in the processing pipeline. Many of these fixes (station rotation, temperature, etc.) are used in some of the first equations that derive dsr, dlr, usr, ulr, etc. so the fix needs to be implemented at the beginning of the L0 to L1 processing step.

I can't comment in detail on the format and script implementation until I've spent some time using it.

BaptisteVandecrux commented 1 year ago

First implementation submitted as PR.

Remaining points that will need to be addressed:

Do we provide a quality field for the variable being adjusted that would indicate whether a value has been adjusted or not?
We need a plotting routine to be run after the processing (so it doesn't slow down NRT data upload), that describes how the data has been manipulated.

BaptisteVandecrux commented 10 months ago

Flags and adjustment CSVs are working quite well. I'm closing this one now.

The flagged data can be plotted using scripts like https://github.com/GEUS-Glaciology-and-Climate/PROMICE-AWS-diagnostic or visualized on https://github.com/GEUS-Glaciology-and-Climate/PROMICE-AWS-diagnostic/blob/main/plot_compilations/flags_toc.md

https://github.com/GEUS-Glaciology-and-Climate/PROMICE-AWS-diagnostic could be run automatically if needed.

Right now suspicious data is just removed from the level 3 data but the level 0 remains intact and the flagging procedure is reproducible.

GEUS-Glaciology-and-Climate / pypromice

Fixing flagged data #19

Example for DepthPressureTransducer_Cor at KAN_L.

adjust_data