Clustering.cluster: handle events that can only be compared indirectly

flixha commented 2 years ago

dist_mat can contain NaNs in the following case: Template 1: traces STA1.HHZ, STA2.HHZ Template 2: trace STA1.HHZ Template 3: trace STA2.HHZ Here template 1 & 2 and 1 & 3 can be correlated, but 2 & 3 cannot be correlated because no traces match between them. But template 2 and 3 can be compared indirectly via template 1. When the distance-matrix contains NaNs, squareform in L287 always causes an error.

What does this PR do?

Replaces NaNs in dist_mat with the average of the column / row so that NaN-links between single event pairs do not make cluster() fail.

Why was it initiated? Any relevant Issues?

Clustering did not work as expected when station setups changed too much (e.g., multiple temporary deployments where only some events have overlapping station coverage from permanent stations).

any thoughts on whether it's better to use mean / median / minimum to replace the NaN-values to least affect the clustering?
should this handling of indirectly linked events be an option rather than the default?

PR Checklist

[X] develop base branch selected?
[X] This PR is not directly related to an existing issue (which has no PR yet).
[X] All tests still pass.
[X] Any new features or fixed regressions are be covered via new tests.
[X Any new or changed features have are fully documented.
[X] Significant changes have been added to CHANGES.md.

calum-chamberlain commented 2 years ago

I would be cautious of replacing the NaNs with real values as it may provide clusters where there are no clusters. From a skeptical perspective I would rather replace NaNs with ones (with one being the maximum distance possible I think). This would reduce the ability to cluster if clustering is not based on the minimum distance between events.

I think if we are changing this, then we should make it an option for the user to set, and also let them set how the clusters are made (e.g. with the method kwarg which I have been meaning to expose anyway). I would keep the default to put NaNs in the matrix, but raise a useful error message telling the user to consider how they want to replace NaNs and the linkage method - we could also require user input at that stage if NaNs are found.

I'm reticent to allow this to make clusters where there are no links without explicitly telling the user that something funky might happen (and people don't read warnings unless something breaks, so we should make it break, or require user input to make sure that the user thinks about what they are doing).

flixha commented 2 years ago

Thanks for the quick feedback. I have now made some changes so that:

nan-replacement is optional
an error with an explanation is thrown when there's nans in the dist_mat
user can choose value for nan-replacement, with a suggestion to use 1
kwargs can be supplied to clustering() to control the method, metric, and sort_order of linkage as you suggested

It's probably reasonable to replace nans with ones to avoid clustering where there is no real clustering. A lot is taken care of by excluding events where no traces match with any other events. Now that we expose the linkage methods and metrics, I feel it's good if the user can control whether to replace with e.g., 1, mean, min etc., so that this can be adjusted properly to the method and metric that the user uses for linkage-calculation.

Let me know if you have other suggestions for this and if you feel this handling is useful and makes sense like this.

Some tests are failing due the an obspy-numpy conflict, easiest may be to wait for the obspy-update. All tests pass locally with master-obspy.

calum-chamberlain commented 2 years ago

That looks great @flixha, thank you!

Good point on the obspy numpy issues - if you are happy to wait for this to be merged until the obspy release then that might be simplest, otherwise I can faff around with pinning versions until then,

flixha commented 2 years ago

@calum-chamberlain thanks! No worries, let's wait for Obspy; no need to do the extra work when the solution is around the corner.

calum-chamberlain commented 2 years ago

@flixha this is now running obspy 1.3.0, and it looks like there is a test fail in comparing clusters. when you have the chance would you be able to have a look at that test?

flixha commented 2 years ago

@flixha this is now running obspy 1.3.0, and it looks like there is a test fail in comparing clusters. when you have the chance would you be able to have a look at that test?

Hi Calum, of course; sorry for being a bit slow to answer. I'll check this tomorrow..

flixha commented 2 years ago

HI @calum-chamberlain , finally found out what caused the problem. That took longer than expected... Anyways, the problem was that because of the hacky way in which I remove a single trace from each stream, the order of the streams actually mattered. And that order depends on the order in which the files appear in glob. Now I order the template streams and then it worked on all the systems I tested. Waiting for tests on here now.

flixha commented 2 years ago

Thanks for having a look; Yes, this is good to go!

eqcorrscan / EQcorrscan