Implementation details of indicator correlation functions from pws-pyqc

cchwala commented 5 months ago

For the FZ filter and the HI filter the functions are pretty simple, just taking two time series as input and providing one time series as output. For the indicator correlation the current functions (see here) have more complex (but not complicated) input. Since we are not yet sure if and how we fully want to rely on xarray.Dataset as data model, i.e. to always supply the Dataset as function input so that metadata is automatically available based in naming conventions of WG1, is is not yet clear how we want to or have to reshape these functions.

My current preference would be to have functions that take two main inputs, one time series of the PWS data to flag, and one or multiple timeseries from the neighboring sensors or from the reference. This is easier to experiment with and, if implemented correctly, can easily scale for a large number of PWSs. However, if multiple time series (from the reference or neighbors) are to be supplied as input, we should require the usage of a xarray.Dataarray or a pandas.Dataframe since there is the danger of having the wrong order of dimension when using a 2D numpy.array. IMO we should wait till we have an example dataset and example workflow implemented. Then we can try different approaches in a hands-in manner.

cchwala commented 5 months ago

Today @JochenSeidel and myself refactored the existing code for the indicator correlation that was used in the training school. The result is in this notebook below the heading "Test indicator correlation code".

Here is the relevant code

```python def calc_indicator_correlation(a_dataset, b_dataset, prob, exclude_nan=True, min_valid_overlap=None): """ To calcualte indicator correlation two datasets Parameters ---------- a_dataset: first data vector b_dataset: second data vector perc: percentile threshold Returns ---------- indicator correlation value Raises ---------- """ if len(a_dataset.shape) != 1: raise ValueError('`a_dataset` has to be a 1D numpy.ndarray') if a_dataset.shape != b_dataset.shape: raise ValueError('`a_dataset` and `b_dataset` have to have the same shape') a_dataset = np.copy(a_dataset) b_dataset = np.copy(b_dataset) both_not_nan = ~np.isnan(a_dataset) & ~np.isnan(b_dataset) if exclude_nan: a_dataset = a_dataset[both_not_nan] b_dataset = b_dataset[both_not_nan] if min_valid_overlap is not None: if sum(both_not_nan) < min_valid_overlap: return np.nan else: if sum(both_not_nan) == 0: raise ValueError('No overlapping data. Define `min_valid_overlap` to return NaN in such cases.') # Get index at quantile threshold `prob` ix = int(a_dataset.shape[0] * prob) # Set values below quantile threshold `prob` to 0 # and above to 1 a_sort = np.sort(a_dataset) b_sort = np.sort(b_dataset) a_dataset[a_dataset < a_sort[ix]] = 0 b_dataset[b_dataset < b_sort[ix]] = 0 a_dataset[a_dataset > 0] = 1 b_dataset[b_dataset > 0] = 1 # Calcualte correlation of 0 and 1 time series cc = np.corrcoef(a_dataset, b_dataset)[0, 1] return cc ``` ```python def calc_indic_corr_all_stns_new( da_a, da_b, max_distance=50000, # this is in meters, assuming the projection units are also meters prob=0.99, exclude_nan=True, min_valid_overlap=None, ): """ Indicator correlation between reference and test stations return: indicator correlation and distance values """ import scipy xy_a = list(zip(da_a.x.data, da_a.y.data)) xy_b = list(zip(da_b.x.data, da_b.y.data)) dist_mtx = scipy.spatial.distance.cdist(xy_a, xy_b, metric='euclidean') list_corr = [] list_dist = [] for i in tqdm.tqdm(range(len(xy_a) - 1)): for j in range(i + 1, len(xy_b)): # check if distance between stations is less than max_distance if dist_mtx[i, j] < max_distance: indi_corr = calc_indicator_correlation( da_a.isel(id=i).data, da_b.isel(id=j).data, prob=prob, exclude_nan=exclude_nan, min_valid_overlap=min_valid_overlap, ) list_dist.append(dist_mtx[i, j]) list_corr.append(indi_corr) dist_vals = np.asarray(list_dist) corr_vals = np.asarray(list_corr) return dist_vals, corr_vals ```

We tested this code in the notebook with the Amsterdam PWS data that was used in the training school and, from visual comparison with the plot from the training school, the resulting plot with distance vs. indicator correlation looked good. The code should still be checked and, of course, tests are required. But in general, the current structure is IMO the one that we should implement in pypwsqc.

cchwala commented 2 months ago

Notes from discussion today

This function is working on two "individual time series". That should be made clear in the doc string and maybe also from the variable name.

def calc_indicator_correlation(a_dataset, b_dataset, prob, exclude_nan=True, min_valid_overlap=None):
    """
    To calcualte indicator correlation two datasets

    Parameters
    ----------
    a_dataset: first data vector
    b_dataset: second data vector
    perc: percentile threshold

To be considered: Calculation and representation of distances between station datasets. See https://github.com/OpenSenseAction/poligrain/issues/37#issuecomment-2004375651 for info on using sparse matrix. The solution should be to use sparse matrix distance calculation here

dist_mtx = scipy.spatial.distance.cdist(xy_a, xy_b, metric='euclidean')

This sparse distance calculation should be the default behavior of poligrain.spatial.calc_point_to_point_distances.

regarding the question how to return the correlation values as a matrix: we will use a sparse matrix inside the function and fill only the position with close enough distance, i.e. here the code from above with some comments


dist_mtx = sparse_distance_calculation(....)

# note that I do not like the variable names yet
corr_mtx = copy_of_sparse_matrix_or_empty_sparse_matrix(....)

 for i in tqdm.tqdm(range(len(xy_a) - 1)):
        for j in range(i + 1, len(xy_b)):
            # Maybe find a better way to only calculate for populated entries in sparse matrix
            if dist_mtx[i, j] < max_distance:
                indi_corr = calc_indicator_correlation(
                    da_a.isel(id=i).data, 
                    da_b.isel(id=j).data,
                    prob=prob,
                    exclude_nan=exclude_nan, 
                    min_valid_overlap=min_valid_overlap,
                )
                # fill the matrix instead of append to list
                list_dist.append(dist_mtx[i, j])
                list_corr.append(indi_corr)

return dist_mtx, coor_mtx

cchwala commented 1 month ago

Info from @JochenSeidel regarding computation time:

10 years, hourly data
3600 PWS

--> 11 hours computation time

...pretty slow

OpenSenseAction / pypwsqc

Implementation details of indicator correlation functions from pws-pyqc #6