OpenSenseAction / pypwsqc

Python package for quality control (QC) of data from personal weather stations (PWS)
https://pypwsqc.readthedocs.io
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Implementation details of indicator correlation functions from pws-pyqc #6

Open cchwala opened 5 months ago

cchwala commented 5 months ago

For the FZ filter and the HI filter the functions are pretty simple, just taking two time series as input and providing one time series as output. For the indicator correlation the current functions (see here) have more complex (but not complicated) input. Since we are not yet sure if and how we fully want to rely on xarray.Dataset as data model, i.e. to always supply the Dataset as function input so that metadata is automatically available based in naming conventions of WG1, is is not yet clear how we want to or have to reshape these functions.

My current preference would be to have functions that take two main inputs, one time series of the PWS data to flag, and one or multiple timeseries from the neighboring sensors or from the reference. This is easier to experiment with and, if implemented correctly, can easily scale for a large number of PWSs. However, if multiple time series (from the reference or neighbors) are to be supplied as input, we should require the usage of a xarray.Dataarray or a pandas.Dataframe since there is the danger of having the wrong order of dimension when using a 2D numpy.array. IMO we should wait till we have an example dataset and example workflow implemented. Then we can try different approaches in a hands-in manner.

cchwala commented 5 months ago

Today @JochenSeidel and myself refactored the existing code for the indicator correlation that was used in the training school. The result is in this notebook below the heading "Test indicator correlation code".

Here is the relevant code ```python def calc_indicator_correlation(a_dataset, b_dataset, prob, exclude_nan=True, min_valid_overlap=None): """ To calcualte indicator correlation two datasets Parameters ---------- a_dataset: first data vector b_dataset: second data vector perc: percentile threshold Returns ---------- indicator correlation value Raises ---------- """ if len(a_dataset.shape) != 1: raise ValueError('`a_dataset` has to be a 1D numpy.ndarray') if a_dataset.shape != b_dataset.shape: raise ValueError('`a_dataset` and `b_dataset` have to have the same shape') a_dataset = np.copy(a_dataset) b_dataset = np.copy(b_dataset) both_not_nan = ~np.isnan(a_dataset) & ~np.isnan(b_dataset) if exclude_nan: a_dataset = a_dataset[both_not_nan] b_dataset = b_dataset[both_not_nan] if min_valid_overlap is not None: if sum(both_not_nan) < min_valid_overlap: return np.nan else: if sum(both_not_nan) == 0: raise ValueError('No overlapping data. Define `min_valid_overlap` to return NaN in such cases.') # Get index at quantile threshold `prob` ix = int(a_dataset.shape[0] * prob) # Set values below quantile threshold `prob` to 0 # and above to 1 a_sort = np.sort(a_dataset) b_sort = np.sort(b_dataset) a_dataset[a_dataset < a_sort[ix]] = 0 b_dataset[b_dataset < b_sort[ix]] = 0 a_dataset[a_dataset > 0] = 1 b_dataset[b_dataset > 0] = 1 # Calcualte correlation of 0 and 1 time series cc = np.corrcoef(a_dataset, b_dataset)[0, 1] return cc ``` ```python def calc_indic_corr_all_stns_new( da_a, da_b, max_distance=50000, # this is in meters, assuming the projection units are also meters prob=0.99, exclude_nan=True, min_valid_overlap=None, ): """ Indicator correlation between reference and test stations return: indicator correlation and distance values """ import scipy xy_a = list(zip(da_a.x.data, da_a.y.data)) xy_b = list(zip(da_b.x.data, da_b.y.data)) dist_mtx = scipy.spatial.distance.cdist(xy_a, xy_b, metric='euclidean') list_corr = [] list_dist = [] for i in tqdm.tqdm(range(len(xy_a) - 1)): for j in range(i + 1, len(xy_b)): # check if distance between stations is less than max_distance if dist_mtx[i, j] < max_distance: indi_corr = calc_indicator_correlation( da_a.isel(id=i).data, da_b.isel(id=j).data, prob=prob, exclude_nan=exclude_nan, min_valid_overlap=min_valid_overlap, ) list_dist.append(dist_mtx[i, j]) list_corr.append(indi_corr) dist_vals = np.asarray(list_dist) corr_vals = np.asarray(list_corr) return dist_vals, corr_vals ```

We tested this code in the notebook with the Amsterdam PWS data that was used in the training school and, from visual comparison with the plot from the training school, the resulting plot with distance vs. indicator correlation looked good. The code should still be checked and, of course, tests are required. But in general, the current structure is IMO the one that we should implement in pypwsqc.

cchwala commented 2 months ago

Notes from discussion today

def calc_indicator_correlation(a_dataset, b_dataset, prob, exclude_nan=True, min_valid_overlap=None):
    """
    To calcualte indicator correlation two datasets

    Parameters
    ----------
    a_dataset: first data vector
    b_dataset: second data vector
    perc: percentile threshold 
dist_mtx = scipy.spatial.distance.cdist(xy_a, xy_b, metric='euclidean')

This sparse distance calculation should be the default behavior of poligrain.spatial.calc_point_to_point_distances.


dist_mtx = sparse_distance_calculation(....)

# note that I do not like the variable names yet
corr_mtx = copy_of_sparse_matrix_or_empty_sparse_matrix(....)

 for i in tqdm.tqdm(range(len(xy_a) - 1)):
        for j in range(i + 1, len(xy_b)):
            # Maybe find a better way to only calculate for populated entries in sparse matrix
            if dist_mtx[i, j] < max_distance:
                indi_corr = calc_indicator_correlation(
                    da_a.isel(id=i).data, 
                    da_b.isel(id=j).data,
                    prob=prob,
                    exclude_nan=exclude_nan, 
                    min_valid_overlap=min_valid_overlap,
                )
                # fill the matrix instead of append to list
                list_dist.append(dist_mtx[i, j])
                list_corr.append(indi_corr)

return dist_mtx, coor_mtx
cchwala commented 1 month ago

Info from @JochenSeidel regarding computation time:

--> 11 hours computation time

...pretty slow