OpenSenseAction / pypwsqc

Python package for quality control (QC) of data from personal weather stations (PWS)
https://pypwsqc.readthedocs.io
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Discuss structure of modules, functions and data model #7

Open cchwala opened 7 months ago

cchwala commented 7 months ago

This issue should discuss the general structure of our Python modules, the functions and our internal data model. This will be entangled with the discussion and decision of the implementation details of the indicator correlation functions (see #6), but the general layout of the modules can already be discussed now.

It is not yet clear to me how we logically split up the code from PWSQC and pws-pyqc. Since both have a part which does flagging of faulty periods and a part which does bias correction, it seems logical to also have separate modules for flagging and bias correction. But it is also logical to keep code from PWSQC and pws-pyqc closely together.

First draft of module structure

flagging.py <-- not sure if this is the correct English term for adding the flags to the PWS timeseries
|── fz_filter()
|── hi_filter() 
|── station_outlier_filter() <-- maybe needs a more descriptive name that indicates what method is used
|── indicator_correlation_filter() <-- not sure about this and the next one
|── calc_indicator_correlation()
|── ...

bias_correct.py
|── quantile_mapping() <-- use what is in pws-pyqc, but not yet sure about details
|── ... <-- something from PWSQC bias correction that is done in station-outlier-filter?
|── ...

Note that we can reference the papers and original codebase for each function in their docstring. Hence, we do not have to hint the origin of the methods in their function name.

Also not that finding neighboring sensors shall be done with the implementation from poligrain, see https://github.com/OpenSenseAction/poligrain/issues/15

Data model

Since we fully embrace xarray and xarray.Dataset in poligrain, it seems logical to also rely on it here. I would, however, first do some experiments when example data and an example workflow is ready. If we can write simple functions that work with 1D time series, we could just pass np.arrays and would have much more generic code. We can still use xarray.Datasets for loading and handling the data, but when passing to function we do not have to rely on it and just use the underlying numpy.arrays. But, let's do some experiments first.

cchwala commented 5 months ago

Here is one idea for how the inputs of the filters like, faulty zero or high influx, could look like.


def some_filter_function(
    ds_to_check, 
    ds_others, 
    max_distance, 
    min_n_of_valid_neigbors, 
    parameter_1, 
    parameter_2
):
    """ Bla bla...

    Parameters
    -----------

    ds_to_check: 
        xr.Dataset with dimension (id, time) of all time series that shall be flaggged. This can
        also be just one time series.
    ds_others:
        xr.Dataset with dimension (id, time) of the other time series from potential neighbors of
        the data in `ds_to_check`. This can be the same `xr.Dataset` as in `ds_to_check`, but it
        can also be a different one.  Within this function here, the fitting neibors will be selected.

    Returns
    --------

    ds_flag:
    """
    # loop over PWSs:
    for pws_id in ds_to_check.id.data:
         # find neighbors and exclude `pws_id` if it is there
         # ...
         for id_neigbor in list_of_neigbor_ids:
             # do something with pairs of time series
   # aggreagte results to xr.Dataset with dim = (id, time)
   # ...
   return ds_flags

This way it can be applied to one full dataset of PWSs. But it can also be split up and be applied to one PWS by selecting one and passing it as ds_to_check but still pass all neighbors (which will be passed are reference and not as copy, hence there should not be a large computational penalty). With this approach it should be easy to apply the functions on a very large dataset in an embarrassingly parallel manner. e.g. on HPC or on large workstation with many CPUs.

We could also extend the function with default kwargs (with default None) to allow passing in some pre-calculated data, like distance matrix and n_valid_neigbors for the neighboring IDs for ds_to_check and ds_others because that will speed up things in the case of applying the function in an embarrassingly parallel manner.