Open cchwala opened 5 months ago
Today @JochenSeidel and myself refactored the existing code for the indicator correlation that was used in the training school. The result is in this notebook below the heading "Test indicator correlation code".
We tested this code in the notebook with the Amsterdam PWS data that was used in the training school and, from visual comparison with the plot from the training school, the resulting plot with distance vs. indicator correlation looked good. The code should still be checked and, of course, tests are required. But in general, the current structure is IMO the one that we should implement in pypwsqc
.
Notes from discussion today
def calc_indicator_correlation(a_dataset, b_dataset, prob, exclude_nan=True, min_valid_overlap=None):
"""
To calcualte indicator correlation two datasets
Parameters
----------
a_dataset: first data vector
b_dataset: second data vector
perc: percentile threshold
dist_mtx = scipy.spatial.distance.cdist(xy_a, xy_b, metric='euclidean')
This sparse distance calculation should be the default behavior of poligrain.spatial.calc_point_to_point_distances
.
dist_mtx = sparse_distance_calculation(....)
# note that I do not like the variable names yet
corr_mtx = copy_of_sparse_matrix_or_empty_sparse_matrix(....)
for i in tqdm.tqdm(range(len(xy_a) - 1)):
for j in range(i + 1, len(xy_b)):
# Maybe find a better way to only calculate for populated entries in sparse matrix
if dist_mtx[i, j] < max_distance:
indi_corr = calc_indicator_correlation(
da_a.isel(id=i).data,
da_b.isel(id=j).data,
prob=prob,
exclude_nan=exclude_nan,
min_valid_overlap=min_valid_overlap,
)
# fill the matrix instead of append to list
list_dist.append(dist_mtx[i, j])
list_corr.append(indi_corr)
return dist_mtx, coor_mtx
Info from @JochenSeidel regarding computation time:
--> 11 hours computation time
...pretty slow
For the FZ filter and the HI filter the functions are pretty simple, just taking two time series as input and providing one time series as output. For the indicator correlation the current functions (see here) have more complex (but not complicated) input. Since we are not yet sure if and how we fully want to rely on
xarray.Dataset
as data model, i.e. to always supply theDataset
as function input so that metadata is automatically available based in naming conventions of WG1, is is not yet clear how we want to or have to reshape these functions.My current preference would be to have functions that take two main inputs, one time series of the PWS data to flag, and one or multiple timeseries from the neighboring sensors or from the reference. This is easier to experiment with and, if implemented correctly, can easily scale for a large number of PWSs. However, if multiple time series (from the reference or neighbors) are to be supplied as input, we should require the usage of a
xarray.Dataarray
or apandas.Dataframe
since there is the danger of having the wrong order of dimension when using a 2Dnumpy.array
. IMO we should wait till we have an example dataset and example workflow implemented. Then we can try different approaches in a hands-in manner.