Discuss structure of modules, functions and data model

OpenSenseAction / pypwsqc

Python package for quality control (QC) of data from personal weather stations (PWS)

BSD 3-Clause "New" or "Revised" License

0 stars 3 forks source link

This issue should discuss the general structure of our Python modules, the functions and our internal data model. This will be entangled with the discussion and decision of the implementation details of the indicator correlation functions (see #6), but the general layout of the modules can already be discussed now.

It is not yet clear to me how we logically split up the code from PWSQC and pws-pyqc. Since both have a part which does flagging of faulty periods and a part which does bias correction, it seems logical to also have separate modules for flagging and bias correction. But it is also logical to keep code from PWSQC and pws-pyqc closely together.

First draft of module structure

flagging.py <-- not sure if this is the correct English term for adding the flags to the PWS timeseries
|── fz_filter()
|── hi_filter() 
|── station_outlier_filter() <-- maybe needs a more descriptive name that indicates what method is used
|── indicator_correlation_filter() <-- not sure about this and the next one
|── calc_indicator_correlation()
|── ...

bias_correct.py
|── quantile_mapping() <-- use what is in pws-pyqc, but not yet sure about details
|── ... <-- something from PWSQC bias correction that is done in station-outlier-filter?
|── ...

Note that we can reference the papers and original codebase for each function in their docstring. Hence, we do not have to hint the origin of the methods in their function name.

Also not that finding neighboring sensors shall be done with the implementation from poligrain, see https://github.com/OpenSenseAction/poligrain/issues/15

Data model

Since we fully embrace xarray and xarray.Dataset in poligrain, it seems logical to also rely on it here. I would, however, first do some experiments when example data and an example workflow is ready. If we can write simple functions that work with 1D time series, we could just pass np.arrays and would have much more generic code. We can still use xarray.Datasets for loading and handling the data, but when passing to function we do not have to rely on it and just use the underlying numpy.arrays. But, let's do some experiments first.

def some_filter_function( ds_to_check, ds_others, max_distance, min_n_of_valid_neigbors, parameter_1, parameter_2 ): """ Bla bla... Parameters ----------- ds_to_check: xr.Dataset with dimension (id, time) of all time series that shall be flaggged. This can also be just one time series. ds_others: xr.Dataset with dimension (id, time) of the other time series from potential neighbors of the data in `ds_to_check`. This can be the same `xr.Dataset` as in `ds_to_check`, but it can also be a different one. Within this function here, the fitting neibors will be selected. Returns -------- ds_flag: """ # loop over PWSs: for pws_id in ds_to_check.id.data: # find neighbors and exclude `pws_id` if it is there # ... for id_neigbor in list_of_neigbor_ids: # do something with pairs of time series # aggreagte results to xr.Dataset with dim = (id, time) # ... return ds_flags

OpenSenseAction / pypwsqc

Discuss structure of modules, functions and data model #7

First draft of module structure

Data model