Flagging data - Githubissues

mankoff commented 3 years ago

In addition to adding the NEAD header, additional effort is needed to create the L0M files.

For example, CEN 2019 raw data has a jump at record 44 (see expandable table below). Perhaps this is the installation date? More importantly, the metadata_csv has 0 for the pt_z_{coef,p_coef,factor} and pt_antifreeze values, and IceHeight_Avg is -999 for all of this file, except for a few samples prior to record 44. It appears from everything > record 44 that there is no pressure transducer at CEN, and the col_Hpt in the metadata.csv file is set to 0, which means in the v3 IDL code this column was skipped, and no Hpt (or in v4 z_pt) processing occurred.

v4 needs a new way to know that it should not be processing the Hpt / z_pt column.

Additional metadata in the NEAD header telling it what to skip
Naming the column skip in the fields header (or if column names must be unique, then skip<n> for each skipped column)
- All columns beginning with skip can then be dropped?
Skip processing if np.all(np.isnan(<column>)), except this does not work because there is valid z_pt values in the first 44 records
Manually remove the first 44 records as part of creating L0M, and then use if np.all(np.isnan())

The first 49 rows and some example column from CEN 2019 raw

TIMESTAMP | RECORD | WindSpeed | WindDirection | WindDirection_SD | ShortwaveRadiationIn_Avg | IceHeight_Avg -- | -- | -- | -- | -- | -- | -- 2017-05-23 10:00:00 | 0 | 17.33 | 172.5 | 0.014 | -999 | 0.936 2017-05-23 10:10:00 | 1 | 0 | 0 | 0 | -999 | -999 2017-05-23 10:20:00 | 2 | 0 | 0 | 0 | -999 | -999 2017-05-23 10:30:00 | 3 | 0 | 0 | 0 | -999 | -999 2017-05-23 10:40:00 | 4 | 0 | 0 | 0 | -999 | 0.916 2017-05-23 10:50:00 | 5 | 0 | 0 | 0 | -999 | 0.916 2017-05-23 11:00:00 | 6 | 0 | 0 | 0 | -999 | 0.916 2017-05-23 11:30:00 | 7 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 11:40:00 | 8 | 0 | 0 | 0 | -999 | 0.956 2017-05-23 11:50:00 | 9 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 12:00:00 | 10 | 0 | 0 | 0 | -999 | 0.956 2017-05-23 12:10:00 | 11 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 12:20:00 | 12 | 0 | 0 | 0 | -999 | 0.963 2017-05-23 12:30:00 | 13 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 12:40:00 | 14 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 12:50:00 | 15 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 13:00:00 | 16 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 13:10:00 | 17 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 13:20:00 | 18 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 13:30:00 | 19 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 13:40:00 | 20 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 13:50:00 | 21 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 14:00:00 | 22 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 14:10:00 | 23 | 0 | 0 | 0 | -999 | 0.929 2017-05-23 14:20:00 | 24 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 14:30:00 | 25 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 14:40:00 | 26 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 14:50:00 | 27 | 0 | 0 | 0 | -999 | 0.936 2017-05-23 15:00:00 | 28 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 15:10:00 | 29 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 15:20:00 | 30 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 15:30:00 | 31 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 15:40:00 | 32 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 15:50:00 | 33 | 0 | 0 | 0 | -999 | 0.943 2017-05-23 16:00:00 | 34 | 0 | 0 | 0 | -999 | 0.963 2017-05-23 16:10:00 | 35 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 16:20:00 | 36 | 0 | 0 | 0 | -999 | 0.949 2017-05-23 16:30:00 | 37 | 0 | 0 | 0 | -999 | 0.956 2017-06-07 16:10:00 | 38 | 0 | 0 | 0 | -999 | 0.909 2017-06-07 16:20:00 | 39 | 0 | 0 | 0 | -999 | -999 2017-06-07 16:30:00 | 40 | 0 | 0 | 0 | -999 | -999 2017-06-07 16:40:00 | 41 | 0 | 0 | 0 | -999 | 0.983 2017-06-09 09:00:00 | 42 | 0 | 0 | 0 | -999 | 1.003 2017-06-10 10:50:00 | 43 | 0 | 0 | 0 | -999 | -999 2017-06-10 11:00:00 | 44 | 0 | 0 | 0 | -999 | -999 2017-07-24 17:50:00 | 45 | 6.396 | 294.1 | 0 | 845.0682 | -999 2017-07-24 18:00:00 | 46 | 6.538 | 291 | 0 | 508.0982 | -999 2017-07-24 18:10:00 | 47 | 6.219 | 288 | 0 | 699.3785 | -999 2017-07-24 18:20:00 | 48 | 6.269 | 287.1 | 0 | 608.1984 | -999 2017-07-24 18:30:00 | 49 | 6.265 | 289.8 | 0 | 771.0673 | -999

mankoff commented 3 years ago

Some of these first few rows of data appear to be fully processed through the existing pipeline and appear in the final CEN_hour_v03.txt output file.

CEN_2019_raw.txt has

2017-05-23 11:00:00	6	206570	1012.915
2017-05-23 11:30:00	7	206600	1012.71
2017-05-23 11:40:00	8	206610	1012.641
2017-05-23 11:50:00	9	206620	1012.706

Where the last column is air pressure. The average of those values is 1012.743.

CEN_hour_v03.txt has

2017	5	23	11	143	6353	1012.74

mankoff commented 3 years ago

A new flags DB solves this.

The initial example entries are below. The first line sets all sensors to NaN (flag = -1 -> NaN) for the specified times at CEN. The second line sets the z_pt values to NaN at all times at CEN. The station and variable fields can contain a string representing one item, a space-delimited list of string (e.g. CEN EGP KAN_M or z_pt p), or * for all.

t0	t1	station	variable	flag	comment
2017-05-23 10:00:00	2017-06-10 11:00:00	CEN	*	-1	suspicious early data including pressure logger which isn’t at the station?
n/a	n/a	CEN	z_pt	-1	not processed per V3

mankoff commented 3 years ago

Flag documentation: https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/main/flags.org Flag DB: https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/main/flags.csv

ghost commented 3 years ago

I understand the flags DB is one single file applying to the entire life of all station at all processing levels, right? If so, this will quickly result in a very large and hard to manage file. Lots of lines with OOL flags will automatically be added, so that updating manual flags will become difficult. What happens when a manual flag is updated leading to new OOL checks with a different outcome? Filenames will also need to be tracked as you correctly note in https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/issues/5#issuecomment-710025995.

A more practical way is to see the flags file as the ancillary QA file accompanying each prduct rather than a system-wide db for internal use during processing. So, one Lx flag file per Lx observations file telling which one of those observations to ignore in any subsequent processing Lx+1. This years's station_A L0 obs. file will have a matching L0 flags file, which is used when producing L0M obs and the L0M flags file so on. The flag files for higher levels will still grow, but those are not supposed to be flagged manually so it doesn't matter. This ties the flags file to the level it refers to, keeps the flags file small for the lower processing levels, and updating a flag only impacts flag files for higher levels, not some other line in the one global flag file.

Keeping separate files also makes it possible to manage all of our stations together while allowing to share flags for public data but not for confidential/commercial stations.

mankoff commented 3 years ago

My idea was this is only for manually added flags. It should grow by a few lines per station per year. Order of magnitude is 100 lines per year if we have 5 manual flags per station and 20 stations). At a minimum, each station visit is 1 line added. I agree, this may become too large within a few years, and because it is unlilkely that anything is ever flagged across multiple stations (or even multiple files) it makes sense to have one flag DB per station or per file.

OOL is a flag, but should never be entered manually. OOL is defined per variable (in the variables DB) and flagged by the code, not by humans.

If there should be a flag DB per file, perhaps the flags should be part of the header? Then we don't need to track the file (it is in the file). We don't need to track the station (it's just the file). We only need to define t0, t1, and variable for each flag. What do you think of this (see flag lines at bottom of header)?



# station_id      = EGP
# field_delimiter = ,
# nodata          = -999
# timezone        = 0
#
# hygroclip_t_offset = 0       # degrees C
# dsr_eng_coef       = 12.71   # from manufacturer to convert from eng units (1E-5 V) to  physical units (W m-2)
# usr_eng_coef       = 12.71   # from manufacturer to convert from eng units (1E-5 V) to  physical units (W m-2)
# dlr_eng_coef       = 12.71   # from manufacturer to convert from eng units (1E-5 V) to  physical units (W m-2)
# ulr_eng_coef       = 12.71   # from manufacturer to convert from eng units (1E-5 V) to  physical units (W m-2)
# pt_z_coef          = 0.359       # Pressure transducer calibibration coefficiont [m water @ 20 C] (TBD: T @ calib?)
# pt_z_p_coef        = 42     # Air pressure at which the pt was calibrated [hPa]
# pt_z_factor        = 24     # Unitless. Scale for logger voltage measurement range (2.5 for CR1000, 1 for CR10X)
# pt_antifreeze      = 100      # Percent antifreeze in the pressure transducer
# boom_azimuth       = 18.5       # degrees
#
# fields = time, rec, min_y, p, t_1, t_2, rh, wspd, wdir, wd_std, dsr, usr, dlr, ulr, t_rad, z_boom, z_boom_q, z_stake, z_stake_q, z_pt, t_i_1, t_i_2, t_i_3, t_i_4, t_i_5, t_i_6, t_i_7, t_i_8, tilt_x, tilt_y, gps_time, gps_lat, gps_lon, gps_alt, gps_geoid, gps_geounit, gps_q, gps_numsat, gps_hdop, t_log, fan_dc, batt_v_ss, batt_v
#
# flag1 = 2017-05-23 10:00:00, 2017-06-10 11:00:00, *, NAN # First 44 rows suspicious. Includes sensors not at station?
# flag2 = n/a, n/a, z_pt, NAN                                                    # Not processed per v3 metadat.csv files
# flag3 = 2012-06-01, 2015-10-12, CHECKME                        # rotated by ~180 degrees. See  https://user-images.githubusercontent.com/145117/92995442-99d11e00-f4b8-11ea-9498-7fc6b05e5efa.png
# flag4 = 2015-10-12, 2015-12-14, VISIT
# flag5 = 2016-05-28, 2015-05-29, VISIT
# flag6 = 2017-07-10 11:33:40, 2017-07-10 18:55:12, VISIT
# flag7 = 2018-08-11 13:55:00, 2018-08-11 16:00:00, VISIT
--

ghost commented 3 years ago

Having the flags in the header works when there are only a few flags, but there are sensors like the sonic rangers that sometimes produce a mix of good and bad (but not OOL) readings that can only be dealt with by manually flagging line by line, unless we have a flag for 'unreliable period' let the users deal with it (they won't, it's sometimes hard for us to interpret). These bad readings could be a significant fraction, like 50 %, over months. That's a lot to fit in the header.

Agree OOL is only ever flagged by the code, I was thinking of those cases when the code may have to flag lots and lots of individual measurements rather than a few long intervals, and also to those cases when a change of manual flag triggers a different outcome of the OOL. Anyway if we are going for separate files these won't be a problem.

BaptisteVandecrux commented 3 years ago

What appears in the header must be what is useful for the data user. Not just the very final user, but the user of that specific file, at that specific level of processing.

If this file is from a low level product, it is mainly going to be inspected by GEUS people for diagnostic. Then having many flags is not an issue and the "all-in-one-file" approach may ease the diagnostic.

At higher processing levels, flags should only contain information that will help the user. For the removal/corrections that are validated by the GEUS team, the user does not need to know all of them when reading the data. They could be made available in a separate file so that users spotting an odd value would have a first place to look to.

The last, public-oriented data files should be as readable as possible and report minimal flags, e.g.:

gap-filled values or other non-measurements
suspected, but unsure malfunction (overheating, frost on sensor...)
known suboptimal conditions for the measurements (low voltage, station shadow in the radiometer's field of view, anemometer in the lee side of the station...)

We could even have a minimalistic final, public-oriented file with limited flags as Dirk and Robert suggested. An then a more complete file containing flags for scientific purpose. How does that sound?

BaptisteVandecrux commented 3 years ago

https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/tree/main/flags

mankoff commented 3 years ago

https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/tree/main/flags

Nice. Thank you. Are all these UNKNOWN or CHECKME - perhaps there is a reason and we can add a flag explaining it? Also, per https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/blob/main/flags.org the format has improved. 00:00:00 is not required.

BaptisteVandecrux commented 3 years ago

we can add a flag explaining it

I haven't taken note of why I discarded each period. But I'll plot these periods out for documentation of the failure and maybe finding out what went wrong.

00:00:00 is not required.

but it can still be added right?

mankoff commented 3 years ago

we can add a flag explaining it

I haven't taken note of why I discarded each period. But I'll plot these periods out for documentation of the failure and maybe finding out what went wrong.

OK we should change it all to CHECKME then. At some point anything that is flagged CHECKME will be written out with associated graphs so that we can move it from CHECKME to a more considered flag, which may be UNKNOWN, but could be something else (e.g. LOW_VOLTAGE).

00:00:00 is not required.

but it can still be added right?

Yes it's fine to keep it.

This reply sent via email... curious how it formats in GitHub.

mankoff commented 3 years ago

Q: Should the flags DBs and the headers (now small text files that are separate from the L0 data, not prepended on top of the data) be in the AWS data repository (https://github.com/GEUS-PROMICE/AWS-data), rather than in the PROMICE-AWS-processing repository?

I can picture a scenario where the code is stable, but we'd be pushing git commits regularly to this code repository because of data: As we flag values or update data and add new header files.

Alternatively, the data repository should be updated regularly, due to both existing files growing as transmissions arrive, or new files arriving hand-carried from the field. These new and changing files should involve regular git commits to https://github.com/GEUS-PROMICE/AWS-data. Because new headers go with that new data (more than with the code), and flags are currently based on the raw data, perhaps the headers and flags should be moved to that repository.

BaptisteVandecrux commented 3 years ago

Q: Should the flags DBs and the headers (now small text files that are separate from the L0 data, not prepended on top of the data) be in the AWS data repository (https://github.com/GEUS-PROMICE/AWS-data), rather than in the PROMICE-AWS-processing repository?

Not sure I understand the question, but the flag file should be in the same folder as the data, or in a "qc" subfolder. I'm in favour of separating scripts from data.

mankoff commented 3 years ago

Agreed - flags with data not code. In which case I need to move my flags DB and the two recent commits (https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/commit/c6588775bb78d86b7e626ce68c6f9ce2e3f67059 and https://github.com/GEUS-PROMICE/PROMICE-AWS-processing/commit/ff5d3007bba790af09979c73f186fa3873660f55) should be removed?

BaptisteVandecrux commented 3 years ago

I moved the flags folder to https://github.com/GEUS-PROMICE/AWS-data/tree/main/flags

BaptisteVandecrux commented 3 years ago

First attempt to read flag files and removing suspicious data. Could also be used to define an extra column in the AWS file containing good/bad flag for each time step.

Using https://github.com/GEUS-PROMICE/PROMICE-AWS-toolbox :

Flag DB (in PROMICE-AWS-toolbox so that the toolbox self sufficient):
Functions needed
how it is used

Does that look reasonable?

Minimal code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import pytz
import os
import difflib as difflib

def load_promice(path_promice):
    '''
    Loading PROMICE data for a given path into a DataFrame.
    + adding time index
    + calculating albedo
    + (optional) calculate RH with regard to water

    INTPUTS:
        path_promice: Path to the desired file containing PROMICE data [string]

    OUTPUTS:
        df: Dataframe containing PROMICE data for the desired settings [DataFrame]
    '''

    df = pd.read_csv(path_promice,delim_whitespace=True)
    df['time'] = df.Year * np.nan

    df['time'] = [datetime.datetime(y,m,d,h).replace(tzinfo=pytz.UTC) for y,m,d,h in zip(df['Year'].values,  df['MonthOfYear'].values, df['DayOfMonth'].values, df['HourOfDay(UTC)'].values)]
    df.set_index('time',inplace=True,drop=False)

    #set invalid values (-999) to nan 
    df[df==-999.0]=np.nan
    df['Albedo'] = df['ShortwaveRadiationUp(W/m2)'] / df['ShortwaveRadiationDown(W/m2)']
    df.loc[df['Albedo']>1,'Albedo']=np.nan
    df.loc[df['Albedo']<0,'Albedo']=np.nan

    # df['RelativeHumidity_w'] = RH_ice2water(df['RelativeHumidity(%)'] ,
    #                                                    df['AirTemperature(C)'])

    return df

def remove_flagged_data(df, site, var_list = ['all'], plot = True):
    '''
    Replace data within a specified variable, between specified dates by NaN.
    Reads from file "metadata/flags/<site>.csv".

    INTPUTS:
        df: PROMICE data with time index
        site: string of PROMICE site
        var_list: list of the variables for which data removal should be 
            conducted (default: all)
        plot: whether data removal should be plotted

    OUTPUTS:
        promice_data: Dataframe containing PROMICE data for the desired settings [DataFrame]
    '''    
    df_out = df.copy()
    if not os.path.isfile('metadata/flags/'+site+'.csv'):
        print('No erroneous data listed for '+site)
        return df

    flag_data = pd.read_csv('metadata/flags/'+site+'.csv')

    if var_list[0]=='all':
        var_list =  np.unique(flag_data.variable)

    print('Deleting flagged data:')
    for var in var_list:
        if var not in df_out.columns :
            var_new = difflib.get_close_matches(var, df_out.columns, n=1)
            if not var_new:
                print('Warning: '+var+' in erroneous data file but not in PROMICE dataframe')
                continue
            else:
                print('Warning: interpreting '+var+' as '+var_new[0])
                var = var_new[0]

        if plot:
            fig = plt.figure(figsize = (15,10))
            df[var].plot(color = 'red',label='bad data')

        for t0, t1 in zip(pd.to_datetime(flag_data.loc[flag_data.variable==var].t0), 
                               pd.to_datetime(flag_data.loc[flag_data.variable==var].t1)):
            print(t0, t1, var)
            df_out.loc[t0:t1, var] = np.NaN

        if plot:
            df_out[var].plot(label='good data',color='green' )
            plt.title(site)
            plt.xlabel('Year')
            plt.ylabel(var)
            var_save = var
            for c in ['(', ')', '/']:
                var_save=var_save.replace(c,'')
            var_save=var_save.replace('%','Perc')

            fig.savefig('figures/'+site+'_'+var_save+'_data_removed.png',dpi=70)
    return df_out

try:
    os.mkdir('figures')
    os.mkdir('out')
except:
    print('figures and output folders already exist')

path_to_PROMICE = 'C:/Users/bav/OneDrive - Geological survey of Denmark and Greenland/Code/AWS_Processing/Input/PROMICE/'

#load PROMICE dataset for a given station, all available years
PROMICE_stations = [('EGP',(75.6247,-35.9748), 2660), 
                    ('KAN_B',(67.1252,-50.1832), 350), 
                    ('KAN_L',(67.0955,-35.9748), 670), 
                   ('KAN_M',(67.0670,-48.8355), 1270), 
                   ('KAN_U',(67.0003,-47.0253), 1840), 
                   ('KPC_L',(79.9108,-24.0828), 370),
                   ('KPC_U',(79.8347,-25.1662), 870), 
                   ('MIT',(65.6922,-37.8280), 440), 
                   ('NUK_K',(64.1623,-51.3587), 710), 
                   ('NUK_L',(64.4822,-49.5358), 530),
                   ('NUK_U',(64.5108,-49.2692), 1120),
                   ('QAS_L',(61.0308,-46.8493), 280),
                   ('QAS_M',(61.0998,-46.8330), 630), 
                   ('QAS_U',(61.1753,-46.8195), 900), 
                   ('SCO_L',(72.2230,-26.8182), 460),
                   ('SCO_U',(72.3933,-27.2333), 970),
                   ('TAS_A',(65.7790,-38.8995), 890),
                   ('TAS_L',(65.6402,-38.8987), 250),
                   ('THU_L',(76.3998,-68.2665), 570),
                   ('THU_U',(76.4197,-68.1463), 760),
                   ('UPE_L',(72.8932,-54.2955), 220), 
                   ('UPE_U',(72.8878,-53.5783), 940)]

for ws in PROMICE_stations:
    site = ws[0]
    print(site)
    df =load_promice(path_to_PROMICE+site+'_hour_v03.txt')
    df_v4 = remove_flagged_data(df, site)
    df_v4.fillna(-999).to_csv('out/'+site+'_hour_v03_L3.txt', sep="\t")

BaptisteVandecrux commented 3 years ago

I would like to get some feedback about the flagging files:

How do we call them?
Where are they stored?
Are their format OK?
Are they being handled smartly by the flagging script?

Current state of https://github.com/GEUS-PROMICE/PROMICE-AWS-toolbox :

remove_flagged_data

[to do: flag instead of remove]

Illustration:

This function reads the station-specific error files metadata/flags/\.csv where the erroneous periods are reported for each variable.

These error files have the following structure:

t0	t1	variable	flag	comment	URL_graphic
2017-05-23 10:00:00	2017-06-10 11:00:00	DepthPressureTransducer_Cor	CHECKME	manually flagged by bav	https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_error_data.png
...	...	...	...	...	...

with

field	meaning
t0	ISO date of the begining of flagged period
t1	ISO date of the end of flagged period
variable	name of the variable to be flagged. [to do: '*' for all variables]
flag	short flagging abreviation: - CHECKME - UNKNOWN - NAN - OOL - VISIT
comment	Description of the issue
URL_graphic	URL to illustration or Github issue thread

The file is comma-separated:

t0,t1,variable,flag,comment,URL_graphic
2012-07-19T00:00:00+00:00,2012-07-30T00:00:00+00:00,SnowHeight(m),CHECKME,manually flagged by bav,https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_error_data.png
2012-07-19T00:00:00+00:00,2012-07-21T00:00:00+00:00,DepthPressureTransducer_Cor(m),CHECKME,manually flagged by bav,https://github.com/GEUS-PROMICE/AWS-data/blob/main/flags/graphics/KPC_L_error_data.png
...

The function remove_flagged_data then removes these flagged data from the dataframe.

mankoff commented 3 years ago

* How do we call them?

* Where are they stored?

* Are their format OK?

* Are they being handled smartly by the flagging script?

We're still in an active development phase so anything here is subject to change. The names and locations are trivial to change and likely will at some point (no metadata folder in the current workflow).

The format is a bit more work to change, but it may not have to. I've used this format for a while and it seems to work fine for me. If it works for you too (for now), then lets keep it. We should clarify that if t1 is only the date and not the time, a time of 00:00:00.00 is assumed, so the end date would not be included in the flagged data.

BaptisteVandecrux commented 3 years ago

Glad to see that you also continued with this format. Can you point at the part of your code that creates the flag field and assign its value based on the file content? I'm sure you do it in a smarter way than me!

mankoff commented 3 years ago

I think this is roughly the same as your implementation above but a) operates on xarray datasets not pandas dataframes and 2) no plotting code.

def flag_NAN(ds):
    flag_file = "./data/flags/" + ds.attrs["station_id"] + ".csv"

    if not pathlib.Path(flag_file).is_file(): return ds # no flag file

    df = pd.read_csv(flag_file, parse_dates=[0,1], comment="#")\
           .dropna(how='all', axis='rows')

    # check format of flags.csv. Either both or neither of t0 and t1 must be defined.
    assert(((np.isnan(df['t0']).astype(np.int) + np.isnan(df['t1']).astype(np.int)) % 2).sum() == 0)
    # for now we only process the NAN flag
    df = df[df['flag'] == "NAN"]
    if df.shape[0] == 0: return ds

    for i in df.index:
        t0, t1, avar = df.loc[i,['t0','t1','variable']]
        # set to all vars if var is "*"
        varlist = avar.split() if avar != '*' else list(ds.variables)
        if 'time' in varlist: varlist.remove("time")
        # set to all times if times are "n/a"
        if pd.isnull(t0): t0, t1 = ds['time'].values[[0,-1]]
        for v in varlist:
            ds[v] = ds[v].where((ds['time'] < t0) | (ds['time'] > t1))

        # TODO: Mark these values in the ds_flags dataset using perhaps flag_LUT.loc["NAN"]['value']

    return ds

PennyHow commented 1 year ago

Note this issue is closely related to #7 - Flag out of limits (OOL) rather than set to NaN

BaptisteVandecrux commented 1 year ago

A first implementation have these files hosted on: https://github.com/GEUS-Glaciology-and-Climate/PROMICE-AWS-data-issues

The reasoning is that they are:

visible to the public (transparency)
versioned (roll back to previous versions if needed)
in the same location as the data issue forum

Drawbacks:

They need to be downloaded by pypromice whenever there is a new processing

PennyHow commented 1 year ago

Nice! I'll take a look at the branch shortly - if you feel you have taken this far enough then you are welcome to open a PR and have it officially reviewed and implemented.

With regards to the drawbacks, I think there will be some way we can incorporate the download into pypromice so that the user does not have to download the .csv files themselves.

We already do this with the AWS L3 data from the Dataverse (in pypromice.get), where we can import a hosted .csv file straight to pandas from URL:

df = pd.read_csv(URL, delimiter='\s+', header=0)

BaptisteVandecrux commented 1 year ago

In the first implementation I am downloading the files locally (wherever pypromice is being used). So if the AWS-data-issue repository is not accessible or if pypromice is run offline, then the latest local csv files can be used.

I'll start a PR so that it gets the conversation started.

PennyHow commented 1 year ago

That sounds like a good offline solution.

Yes, this will be easier to review in a PR. Then we can coordinate on the code line-by-line together

BaptisteVandecrux commented 1 year ago

First implementation submitted as PR.

Remaining points that will need to be addressed:

Do we need multiple flags (NAN, CHECKME, OOL...)? After all if the data needs to be removed and the "comment" field of the flag csv file is well documented, then we don't need them?
Do we want to keep the values and add a quality field full of ['OK', 'NAN', 'CHECKME']? Or do we just NaN the flagged period/variable? I think it is fine to just NaN things if there is a good data processing report made available to the users (see next point).
We need a plotting routine to be run after the processing (so it doesn't slow down NRT data upload), that describes which data has been removed.

BaptisteVandecrux commented 9 months ago

As mentioned in #19, flags and adjustment CSVs are working quite well. I'm closing this one now. Separate issues can be opened for the choice of the flags (NAN, CHECKME...) or for the plotting routines if needed.

GEUS-Glaciology-and-Climate / pypromice

Flagging data #18

remove_flagged_data