demiangomez / Parallel.GAMIT

Python wrapper to parallelize GAMIT executions
BSD 3-Clause "New" or "Revised" License
37 stars 20 forks source link

large subnetworks in pyNetwork module #101

Closed espg closed 3 weeks ago

espg commented 1 month ago

@demiangomez @eckendrick moving this discussion off of email and onto github so we can track it.

This is the original bug report from @eckendrick

qmean = BisectingQMeans(min_clust_size=2, opt_clust_size=18, init='random', n_init=50, algorithm='lloyd', max_iter=8000, random_state=42, n_clusters=2)

I have seen these very large (>90) network sizes (just one large network per day) for our global processing for 7 out of 8 days when processing year 2022 days 001-008. 001/g2global.OSU62/monitor.log:2024-09-09 13:12:40 -> About to execute GAMIT - system count G: 94 002/g2global.OSU73/monitor.log:2024-09-09 13:13:33 -> About to execute GAMIT - system count G: 93 003/g2global.OSU73/monitor.log:2024-09-09 14:08:40 -> About to execute GAMIT - system count G: 96 004/g2global.OSU18/monitor.log:2024-09-09 14:17:34 -> About to execute GAMIT - system count G: 94 005/g2global.OSU74/monitor.log:2024-09-09 22:12:16 -> About to execute GAMIT - system count G: 94 006/g2global.OSU71/monitor.log:2024-09-09 22:32:35 -> About to execute GAMIT - system count G: 96 008/g2global.OSU93/monitor.log:2024-09-09 23:18:41 -> About to execute GAMIT - system count G: 96

My response on fixing this using min_clust_size=1:

I know that you're on a bit of a time crunch with getting things to work, so a temporary fix that will work immediately (and that I've tested) is to set the parameter min_clust_size equal to one (1). Doing this will make the opt_clust_size parameter a hard parameter, i.e., the algorithm will refuse to generate any clusters that have a membership larger than whatever opt_clust_size is set to.

This means that there will be some single stations split off in difficult cases-- but the subnetworks will still have a minimum size of 3 (and a size range of 3 to 11) when run on gamit due to the overcluster function call expanding for overlap.

I'd still like to investigate why those particular days are set up with a scenario that repeatedly has a single outlier station that fails to be split into a group > 1, and see if I can change the algorithm to more intelligently handle this edge case (i.e., forcing a multiway split instead of a binary partition)... but I'm also not sure how long it will take to do so.

@eckendrick 's follow up testing:

I tried changing min_clust_size from 2 to 1, and kept opt_clust_size at 18

grep min_clust_size pyNetwork.py qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,

but my networks are still large. I'm currently testing for our Antarctic processing, so there are ~130 stations each day. Here are the sizes of the subnets for 2022 day 180

g2regain.OSU00/monitor.log:2024-09-11 21:23:52 -> About to execute GAMIT - system count G: 44 g2regain.OSU01/monitor.log:2024-09-11 21:22:51 -> About to execute GAMIT - system count G: 29 g2regain.OSU02/monitor.log:2024-09-11 21:22:32 -> About to execute GAMIT - system count G: 35 g2regain.OSU03/monitor.log:2024-09-11 21:21:18 -> About to execute GAMIT - system count G: 27 g2regain.OSU04/monitor.log:2024-09-11 21:19:29 -> About to execute GAMIT - system count G: 22 g2regain.OSU05/monitor.log:2024-09-11 21:21:15 -> About to execute GAMIT - system count G: 40 g2regain.OSU06/monitor.log:2024-09-11 21:19:06 -> About to execute GAMIT - system count G: 20 g2regain.OSU07/monitor.log:2024-09-11 21:20:18 -> About to execute GAMIT - system count G: 22 g2regain.OSU08/monitor.log:2024-09-11 21:20:24 -> About to execute GAMIT - system count G: 27 g2regain.OSU09/monitor.log:2024-09-11 21:20:07 -> About to execute GAMIT - system count G: 29 g2regain.OSU10/monitor.log:2024-09-11 21:21:43 -> About to execute GAMIT - system count G: 39 g2regain.OSU11/monitor.log:2024-09-11 21:19:23 -> About to execute GAMIT - system count G: 18 g2regain.OSU12/monitor.log:2024-09-11 21:19:51 -> About to execute GAMIT - system count G: 22 g2regain.OSU13/monitor.log:2024-09-11 21:20:25 -> About to execute GAMIT - system count G: 32 g2regain.OSU14/monitor.log:2024-09-11 21:20:41 -> About to execute GAMIT - system count G: 35 g2regain.OSU15/monitor.log:2024-09-11 21:20:22 -> About to execute GAMIT - system count G: 36 g2regain.OSU16/monitor.log:2024-09-11 21:21:55 -> About to execute GAMIT - system count G: 29

...and my thought's from last week:

That's interesting, the testing I did earlier wasn't able to exceed the opt_cluster_size when min_cluster_size was set to 1... this is making me think it might actually be a bug in the over_cluster function. I didn't check passing the QMeans output to over_cluster, but it does get passed to it in our pipeline, and that function might be expanding more than expected for some reason.

Changing the over_cluster parameters might help right now, and I'll need to test that function too when I debug in more detail next week.

espg commented 1 month ago

I've had a chance to do some additional testing, and BisectingQMeans is behaving as expected and discussed above. Here's a gamit run from 2022, day 002 that was giving @eckendrick trouble, using qmean = BisectingQMeans(min_clust_size=2, opt_clust_size=18) ; I'm able to replicated with these parameters the large station cluster:

mpl.pyplot.hist(clust.labels_, bins=clust.n_clusters)

download

When I implement the solution that I suggested, setting qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18), the problem resolves, and defaults to the algorithmic guarantee above (i.e., hard limit on cluster size set by opt_clust_size=n when min_clust_size=1:

download-1

There should still be unit test added to enforce the behavior in future-- but that unit test will pass with the current code base when it is added.

espg commented 1 month ago

As mentioned in the first post above:

That's interesting, the testing I did earlier wasn't able to exceed the opt_cluster_size when min_cluster_size was set to 1... this is making me think it might actually be a bug in the over_cluster function. I didn't check passing the QMeans output to over_cluster, but it does get passed to it in our pipeline, and that function might be expanding more than expected for some reason.

...this appears to be what is happening. Here's the same output from the second figure after it is run through over_cluster:

>>> OC = over_cluster(clust.labels_, stations.T, metric='euclidean', neighborhood=5, overlap_points=2)
>>> np.sum(OC, axis=1)
>>> array([12, 23, 10, 24, 22, 12, 22, 20, 16, 19, 22, 10, 21, 18, 23, 20, 12,
       20, 15, 20, 17, 19, 16, 23, 21, 15, 23, 23, 23, 16, 19, 17, 19, 17,
       25, 21, 14, 19, 21, 20, 17, 16, 18, 21, 18, 17, 22, 22, 19, 18, 23,
       22, 18, 20, 24, 15, 22, 16, 24, 19,  8, 20, 24, 21, 15, 20, 22, 19,
       25, 22, 14, 21, 21, 17, 16, 20, 16, 21, 20, 19, 19, 21, 15, 13, 23,
       14, 13, 18, 16, 19, 18, 15, 20, 14, 16, 22, 20, 17])

There is expected algorithmic guarantee's out of over_cluster ; given that the neighborhood parameter sets how many adjacent clusters to include, and that overlap_points sets the max number of points per neighborhood to join, the output from over_cluster is capped at adding neighborhood * overlap_points at most to the 'BisectingQMeans` clusters. For the above code, those parameters are 5 and 2 respectively, which means at most 10 additional points (stations) added to guaranteed max of 17 (empirically max of 16 for this run). Our current code respects these limits; again, we need a unit test here, but when it is added, it will pass with the current codebase.

espg commented 1 month ago

So what's happening, why are we getting large networks? One possibility is that there's an odd bug related to the termination condition of over_cluster. Right now the default parameter for neighborhood is 5, but @eckendrick is running the code on some sub regions, so what happens when there are less than 5 neighborhoods?

The short answer is that I'm not sure, and we haven't tested for this case, so it might be our problem... or it might behave fine, and we might not actually have a problem at all, but just need to modify some of the parameters.

@eckendrick what is your current input into over_cluster? If the neighborhood and overlap_points are set to 5 and 2 respectively, the max addition is 10 stations... however, if they're both set to 5, then the max addition is 25. Can you check what those are set as?

As I mentioned last week--

Changing the over_cluster parameters might help right now, and I'll need to test that function too when I debug in more detail next week.

I don't know what the exact behavior is for over_cluster when neighborhood is >= to total number of clusters out of BisectingQMeans, and will need to do some testing. However, I do know what the behavior is when total clusters is > neighborhood, and that is that the code works. One thing that we can do while I write the unit tests and debug some of the edge case behavior is to avoid the edge cases. You can do this by changing the over_cluster parameters, specifically setting neighborhood equal to 3. Given our cutoff between when we run the make_clusters vs when we don't, I expect that we'll always have at least 3 clusters since we're clustering a minimum of 45 points. Note that we could in fact guarantee this behavior by setting min_clust_size = 1, and opt_clust_size = 15, as a 45 station submission to bisectingQMeans will definitionally require at least 3 clusters to be made.

demiangomez commented 1 month ago

We still have the network type parameter which is specified as global or regional. Maybe you can use this to set the parameters accordingly?

espg commented 1 month ago

That sounds reasonable, I'll look at adding in a conditional parameter switch for that inside of pyNetwork when I code up the unit tests.

demiangomez commented 1 month ago

Ok, thanks. I forget what is the name of the parameter but it is in the CFG for the GAMIT run. Eric can provide you with one.

espg commented 1 month ago

I need @eckendrick to clarify where and how his runs are diverging from expected behavior. From the unit tests, everything looks fine and within algorithmic expectations. Put another way, this output:

grep min_clust_size pyNetwork.py
qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,

but my networks are still large. I'm currently testing for our Antarctic processing,
so there are ~130 stations each day. Here are the sizes of the subnets for 2022 day 180

g2regain.OSU00/monitor.log:2024-09-11 21:23:52 -> About to execute GAMIT - system count G: 44
g2regain.OSU01/monitor.log:2024-09-11 21:22:51 -> About to execute GAMIT - system count G: 29
g2regain.OSU02/monitor.log:2024-09-11 21:22:32 -> About to execute GAMIT - system count G: 35
g2regain.OSU03/monitor.log:2024-09-11 21:21:18 -> About to execute GAMIT - system count G: 27
g2regain.OSU04/monitor.log:2024-09-11 21:19:29 -> About to execute GAMIT - system count G: 22
g2regain.OSU05/monitor.log:2024-09-11 21:21:15 -> About to execute GAMIT - system count G: 40
g2regain.OSU06/monitor.log:2024-09-11 21:19:06 -> About to execute GAMIT - system count G: 20
g2regain.OSU07/monitor.log:2024-09-11 21:20:18 -> About to execute GAMIT - system count G: 22
g2regain.OSU08/monitor.log:2024-09-11 21:20:24 -> About to execute GAMIT - system count G: 27
g2regain.OSU09/monitor.log:2024-09-11 21:20:07 -> About to execute GAMIT - system count G: 29
g2regain.OSU10/monitor.log:2024-09-11 21:21:43 -> About to execute GAMIT - system count G: 39
g2regain.OSU11/monitor.log:2024-09-11 21:19:23 -> About to execute GAMIT - system count G: 18
g2regain.OSU12/monitor.log:2024-09-11 21:19:51 -> About to execute GAMIT - system count G: 22
g2regain.OSU13/monitor.log:2024-09-11 21:20:25 -> About to execute GAMIT - system count G: 32
g2regain.OSU14/monitor.log:2024-09-11 21:20:41 -> About to execute GAMIT - system count G: 35
g2regain.OSU15/monitor.log:2024-09-11 21:20:22 -> About to execute GAMIT - system count G: 36
g2regain.OSU16/monitor.log:2024-09-11 21:21:55 -> About to execute GAMIT - system count G: 29

...does not look problematic to me. There are no station clusters that are above 50. If we want smaller station clusters, the max size is bounded by this equation:

*Max cluster size <= opt_clust_size + (neighborhood overlap_points), when min_clust_size=1**

With opt_clust_size & min_clust_size=1 being set in BisectingQMeans, and neighborhood & overlap_points being set in over_cluster.

I am still working on enhancements to the underlying BisectingQMeans algorithm, and will look at @demiangomez suggestion to change the default parameter settings that go into over_cluster depending on if we are using a regional vs a global run. That said, this should not be a blocking issue for processing with ParallelGAMIT in the present moment, given that all the algorithm parameters can be set to get the station cluster sizes that you want.

eckendrick commented 1 month ago

Hi Shane, I misunderstood your comment about the algorithm refusing to generate any clusters that have a membership larger than whatever opt_clust_size is set to. Since I had opt_clust_size set to 18, I was expecting networks 01 and larger to have a maximum size of 18. Eric " a temporary fix that will work immediately (and that I've tested) is to set the parameter min_clust_size equal to one (1). Doing this will make the opt_clust_size parameter a hard parameter, i.e., the algorithm will refuse to generate any clusters that have a membership larger than whatever opt_clust_size is set to."

espg commented 1 month ago

@eckendrick can you tell me what your over_cluster parameter values for neighborhood and overlap_points was set to for those runs above?

If they were set to neighborhood=5 and overlap_points=2, then we still have a bug since 18 + (5*2) = 28, and you have clusters in the 30's and 40's.

On the other hand, if the values are neighborhood=5 and overlap_points=5, then we might not have a bug (assuming that the OSU00 net doesn't get passed to make_clusters if it's a single cluster under the 45 station member network backbone threshold).

espg commented 1 month ago

We still have the network type parameter which is specified as global or regional. Maybe you can use this to set the parameters accordingly?

@demiangomez I've given this some more thought, and did something else in #104 which explicitly checks for when the number of clusters going into over_cluster is low (close to / equal to parameter neighborhood), and adjusts the neighborhood parameter value downwards so it doesn't ever exceed the number of clusters that go into that function... so I don't think we'll need to put a case switch for regional vs global networks now.

eckendrick commented 1 month ago

@espg my settings for neighborhood and overlap_points are neighborhood=5, overlap_points=2

espg commented 3 weeks ago

@demiangomez the large networks reported are not replicable in isolation when using the parameters that I recommended on Sept 10th and 11th-- as a reminder, this was the what @eckendrick reported:

I tried changing min_clust_size from 2 to 1, and kept opt_clust_size at 18

grep min_clust_size pyNetwork.py qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,

but my networks are still large. I'm currently testing for our Antarctic processing, so there are ~130 stations each day. Here are the sizes of the subnets for 2022 day 180

g2regain.OSU00/monitor.log:2024-09-11 21:23:52 -> About to execute GAMIT - system count G: 44 g2regain.OSU01/monitor.log:2024-09-11 21:22:51 -> About to execute GAMIT - system count G: 29 g2regain.OSU02/monitor.log:2024-09-11 21:22:32 -> About to execute GAMIT - system count G: 35 g2regain.OSU03/monitor.log:2024-09-11 21:21:18 -> About to execute GAMIT - system count G: 27 g2regain.OSU04/monitor.log:2024-09-11 21:19:29 -> About to execute GAMIT - system count G: 22 g2regain.OSU05/monitor.log:2024-09-11 21:21:15 -> About to execute GAMIT - system count G: 40 g2regain.OSU06/monitor.log:2024-09-11 21:19:06 -> About to execute GAMIT - system count G: 20 g2regain.OSU07/monitor.log:2024-09-11 21:20:18 -> About to execute GAMIT - system count G: 22 g2regain.OSU08/monitor.log:2024-09-11 21:20:24 -> About to execute GAMIT - system count G: 27 g2regain.OSU09/monitor.log:2024-09-11 21:20:07 -> About to execute GAMIT - system count G: 29 g2regain.OSU10/monitor.log:2024-09-11 21:21:43 -> About to execute GAMIT - system count G: 39 g2regain.OSU11/monitor.log:2024-09-11 21:19:23 -> About to execute GAMIT - system count G: 18 g2regain.OSU12/monitor.log:2024-09-11 21:19:51 -> About to execute GAMIT - system count G: 22 g2regain.OSU13/monitor.log:2024-09-11 21:20:25 -> About to execute GAMIT - system count G: 32 g2regain.OSU14/monitor.log:2024-09-11 21:20:41 -> About to execute GAMIT - system count G: 35 g2regain.OSU15/monitor.log:2024-09-11 21:20:22 -> About to execute GAMIT - system count G: 36 g2regain.OSU16/monitor.log:2024-09-11 21:21:55 -> About to execute GAMIT - system count G: 29

Here's the code running the exact same data input on the make_clusters function from pyNetwork:

from NetClusters import BisectingQMeans, over_cluster, select_central_point
from pyStation import StationCollection
import pandas as pd
import numpy as np

# same input data
df = pd.read_csv('/Users/espg/Downloads/gamit_subnets_180_2022.csv')
xyz = df[['X','Y','Z']].values
points = np.stack((x,y,z))
stations = list(zip(df['NetworkCode'], df['StationCode']))

# Same 'make_clusters' function, with minimal lines modified for output handling 
def make_clusters(points, stations):
    # Run initial clustering using bisecting 'q-means'
    # Function call matching @eckendrick parameters 
    qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
                            init='random', n_init=50, algorithm='lloyd',
                            max_iter=8000, random_state=42, n_clusters=2)
    qmean.fit(points)
    # snap centroids to closest station coordinate
    central_points = select_central_point(qmean.labels_, points,
                                          qmean.cluster_centers_)
    # expand the initial clusters to overlap stations with neighbors
    OC = over_cluster(qmean.labels_, points, metric='euclidean',
                      neighborhood=5, overlap_points=2)
    # calculate all 'tie' stations
    ties = np.where(np.sum(OC, axis=0) > 1)[0]
    # monotonic labels, compatible with previous data structure / api
    cluster_labels = []
    station_labels = []
    cluster_ties = []

    # init'ed outside of the loop for efficiency...
    stat_labs = np.array(stations) #stations.labels_array()
    for row, cluster in enumerate(OC):
        # Create 'station collections' for compat
        my_stations = [] #StationCollection()
        my_cluster_ties = [] #StationCollection()
        # strip out station id's per cluster...
        for station in stat_labs[cluster]:
            # rebuild as a 'station collection list'
            # my_stations.append(stations[str(station)])
            my_stations.append(station) # modified (above line) for testing 
        # append to a regular list for integer indexing at line ~400
        station_labels.append(my_stations)
        cluster_labels.append(np.ones((1, np.sum(cluster)),
                                       dtype=np.int_).squeeze()*row)
        # strip out station id's for tie points....
        for statn in stat_labs[ties[np.isin(ties, np.where(cluster)[0])]]:
            # rebuild as a 'station collection list'
            #my_cluster_ties.append(stations[str(statn)])
            my_cluster_ties.append(statn) # modified (above line) for testing
        # append to a regular list for integer indexing at line ~400
        cluster_ties.append(my_cluster_ties)

    # put everything in a dictionary
    clusters = {'centroids': points[central_points],
                'labels': cluster_labels,
                'stations': station_labels}

    return clusters, cluster_ties

a, b = make_clusters(points.T, stations)

These are the output clusters (after BisectingQmeans, over_cluster, and associated parsing):

>>> a['stations']

[[array(['igs', 'badg'], dtype='<U4'),
  array(['igs', 'cas1'], dtype='<U4'),
  array(['igs', 'coco'], dtype='<U4'),
  array(['igs', 'daej'], dtype='<U4'),
  array(['igs', 'darw'], dtype='<U4'),
  array(['igs', 'dumg'], dtype='<U4'),
  array(['igs', 'guam'], dtype='<U4'),
  array(['igs', 'hob2'], dtype='<U4'),
  array(['igs', 'hrao'], dtype='<U4'),
  array(['igs', 'iisc'], dtype='<U4'),
  array(['igs', 'kiru'], dtype='<U4'),
  array(['igs', 'mal2'], dtype='<U4'),
  array(['igs', 'mcil'], dtype='<U4'),
  array(['igs', 'mobs'], dtype='<U4'),
  array(['igs', 'nklg'], dtype='<U4'),
  array(['igs', 'pohn'], dtype='<U4'),
  array(['igs', 'pol2'], dtype='<U4'),
  array(['igs', 'reun'], dtype='<U4')],
 [array(['igs', 'chpi'], dtype='<U4'),
  array(['igs', 'ckis'], dtype='<U4'),
  array(['igs', 'cro1'], dtype='<U4'),
  array(['igs', 'drao'], dtype='<U4'),
  array(['igs', 'fair'], dtype='<U4'),
  array(['igs', 'falk'], dtype='<U4'),
  array(['igs', 'gode'], dtype='<U4'),
  array(['igs', 'holm'], dtype='<U4'),
  array(['igs', 'hrao'], dtype='<U4'),
  array(['igs', 'kiru'], dtype='<U4'),
  array(['igs', 'kokb'], dtype='<U4'),
  array(['igs', 'kour'], dtype='<U4'),
  array(['igs', 'mal2'], dtype='<U4'),
  array(['igs', 'mas1'], dtype='<U4'),
  array(['igs', 'mate'], dtype='<U4'),
  array(['igs', 'mkea'], dtype='<U4'),
  array(['igs', 'nklg'], dtype='<U4'),
  array(['igs', 'pie1'], dtype='<U4'),
  array(['igs', 'pol2'], dtype='<U4'),
  array(['igs', 'sthl'], dtype='<U4'),
  array(['igs', 'stjo'], dtype='<U4'),
  array(['igs', 'thti'], dtype='<U4'),
  array(['igs', 'vndp'], dtype='<U4')],
 [array(['igs', 'chpi'], dtype='<U4'),
  array(['igs', 'dav1'], dtype='<U4'),
  array(['igs', 'hrao'], dtype='<U4'),
  array(['igs', 'jctw'], dtype='<U4'),
  array(['igs', 'kerg'], dtype='<U4'),
  array(['igs', 'krgg'], dtype='<U4'),
  array(['igs', 'mal2'], dtype='<U4'),
  array(['igs', 'maw1'], dtype='<U4'),
  array(['igs', 'nklg'], dtype='<U4'),
  array(['igs', 'reun'], dtype='<U4'),
  array(['igs', 'simo'], dtype='<U4'),
  array(['igs', 'sthl'], dtype='<U4'),
  array(['igs', 'suth'], dtype='<U4'),
  array(['igs', 'syog'], dtype='<U4'),
  array(['igs', 'vesl'], dtype='<U4'),
  array(['tig', 'cztg'], dtype='<U4'),
  array(['tig', 'hnus'], dtype='<U4')],
 [array(['igs', 'bren'], dtype='<U4'),
  array(['igs', 'falk'], dtype='<U4'),
  array(['igs', 'fos1'], dtype='<U4'),
  array(['igs', 'frei'], dtype='<U4'),
  array(['igs', 'kepa'], dtype='<U4'),
  array(['igs', 'lply'], dtype='<U4'),
  array(['igs', 'ohi3'], dtype='<U4'),
  array(['igs', 'parc'], dtype='<U4'),
  array(['igs', 'rgdg'], dtype='<U4'),
  array(['igs', 'rio2'], dtype='<U4'),
  array(['igs', 'roth'], dtype='<U4'),
  array(['igs', 'sgp5'], dtype='<U4'),
  array(['igs', 'thur'], dtype='<U4'),
  array(['igs', 'vesl'], dtype='<U4')],
 [array(['ans', 'belg'], dtype='<U4'),
  array(['ant', 'smr5'], dtype='<U4'),
  array(['igs', 'bren'], dtype='<U4'),
  array(['igs', 'bsa1'], dtype='<U4'),
  array(['igs', 'fos1'], dtype='<U4'),
  array(['igs', 'hown'], dtype='<U4'),
  array(['igs', 'lply'], dtype='<U4'),
  array(['igs', 'rgdg'], dtype='<U4'),
  array(['igs', 'roth'], dtype='<U4'),
  array(['igs', 'sgp3'], dtype='<U4'),
  array(['igs', 'sgp4'], dtype='<U4'),
  array(['igs', 'sgp5'], dtype='<U4'),
  array(['igs', 'trve'], dtype='<U4'),
  array(['igs', 'wiln'], dtype='<U4'),
  array(['igs', 'wlch'], dtype='<U4')],
 [array(['ans', 'belg'], dtype='<U4'),
  array(['ant', 'sprz'], dtype='<U4'),
  array(['arg', 'mbio'], dtype='<U4'),
  array(['igs', 'bren'], dtype='<U4'),
  array(['igs', 'bsa1'], dtype='<U4'),
  array(['igs', 'capf'], dtype='<U4'),
  array(['igs', 'dupt'], dtype='<U4'),
  array(['igs', 'fos1'], dtype='<U4'),
  array(['igs', 'frei'], dtype='<U4'),
  array(['igs', 'hown'], dtype='<U4'),
  array(['igs', 'hugo'], dtype='<U4'),
  array(['igs', 'lply'], dtype='<U4'),
  array(['igs', 'ohi2'], dtype='<U4'),
  array(['igs', 'ohi3'], dtype='<U4'),
  array(['igs', 'pal2'], dtype='<U4'),
  array(['igs', 'palm'], dtype='<U4'),
  array(['igs', 'rgdg'], dtype='<U4'),
  array(['igs', 'rio2'], dtype='<U4'),
  array(['igs', 'robn'], dtype='<U4'),
  array(['igs', 'sgp1'], dtype='<U4'),
  array(['igs', 'sgp3'], dtype='<U4'),
  array(['igs', 'sgp4'], dtype='<U4'),
  array(['igs', 'sgp5'], dtype='<U4'),
  array(['igs', 'spgt'], dtype='<U4')],
 [array(['ant', 'sprz'], dtype='<U4'),
  array(['igs', 'chpi'], dtype='<U4'),
  array(['igs', 'cro1'], dtype='<U4'),
  array(['igs', 'falk'], dtype='<U4'),
  array(['igs', 'frei'], dtype='<U4'),
  array(['igs', 'kepa'], dtype='<U4'),
  array(['igs', 'kour'], dtype='<U4'),
  array(['igs', 'sgp5'], dtype='<U4'),
  array(['igs', 'sthl'], dtype='<U4')],
 [array(['igs', 'bean'], dtype='<U4'),
  array(['igs', 'bren'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'fos1'], dtype='<U4'),
  array(['igs', 'gmez'], dtype='<U4'),
  array(['igs', 'haag'], dtype='<U4'),
  array(['igs', 'hown'], dtype='<U4'),
  array(['igs', 'hton'], dtype='<U4'),
  array(['igs', 'jnsn'], dtype='<U4'),
  array(['igs', 'lntk'], dtype='<U4'),
  array(['igs', 'mtjn'], dtype='<U4'),
  array(['igs', 'pirt'], dtype='<U4'),
  array(['igs', 'sgp3'], dtype='<U4'),
  array(['igs', 'sgp4'], dtype='<U4'),
  array(['igs', 'trve'], dtype='<U4'),
  array(['igs', 'wiln'], dtype='<U4'),
  array(['igs', 'wlch'], dtype='<U4')],
 [array(['igs', 'berp'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'gldk'], dtype='<U4'),
  array(['igs', 'haag'], dtype='<U4'),
  array(['igs', 'hown'], dtype='<U4'),
  array(['igs', 'lntk'], dtype='<U4'),
  array(['igs', 'lply'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'mmrp'], dtype='<U4'),
  array(['igs', 'mrtp'], dtype='<U4'),
  array(['igs', 'mtak'], dtype='<U4'),
  array(['igs', 'mtjn'], dtype='<U4'),
  array(['igs', 'pirt'], dtype='<U4'),
  array(['igs', 'sgp4'], dtype='<U4'),
  array(['igs', 'sltr'], dtype='<U4'),
  array(['igs', 'thur'], dtype='<U4'),
  array(['igs', 'tomo'], dtype='<U4'),
  array(['igs', 'trve'], dtype='<U4'),
  array(['igs', 'wlch'], dtype='<U4'),
  array(['igs', 'wlrd'], dtype='<U4')],
 [array(['ans', 'belg'], dtype='<U4'),
  array(['igs', 'amu2'], dtype='<U4'),
  array(['igs', 'bean'], dtype='<U4'),
  array(['igs', 'benn'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'crdi'], dtype='<U4'),
  array(['igs', 'haag'], dtype='<U4'),
  array(['igs', 'hown'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'mtjn'], dtype='<U4'),
  array(['igs', 'pece'], dtype='<U4'),
  array(['igs', 'pirt'], dtype='<U4'),
  array(['igs', 'rmbo'], dtype='<U4'),
  array(['igs', 'stew'], dtype='<U4'),
  array(['igs', 'vesl'], dtype='<U4'),
  array(['igs', 'whtm'], dtype='<U4'),
  array(['igs', 'wiln'], dtype='<U4'),
  array(['igs', 'wlch'], dtype='<U4'),
  array(['igs', 'wlct'], dtype='<U4'),
  array(['igs', 'wlrd'], dtype='<U4')],
 [array(['ans', 'belg'], dtype='<U4'),
  array(['ant', 'zhon'], dtype='<U4'),
  array(['arg', 'mbio'], dtype='<U4'),
  array(['igs', 'bren'], dtype='<U4'),
  array(['igs', 'crdi'], dtype='<U4'),
  array(['igs', 'hton'], dtype='<U4'),
  array(['igs', 'maw1'], dtype='<U4'),
  array(['igs', 'syog'], dtype='<U4'),
  array(['igs', 'trve'], dtype='<U4'),
  array(['igs', 'vesl'], dtype='<U4'),
  array(['igs', 'wlch'], dtype='<U4')],
 [array(['ant', 'tnb1'], dtype='<U4'),
  array(['ant', 'tnb2'], dtype='<U4'),
  array(['igs', 'benn'], dtype='<U4'),
  array(['igs', 'cas1'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'dumg'], dtype='<U4'),
  array(['igs', 'mac1'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'vl01'], dtype='<U4'),
  array(['igs', 'vl12'], dtype='<U4'),
  array(['igs', 'vl30'], dtype='<U4')],
 [array(['ant', 'tnb1'], dtype='<U4'),
  array(['ant', 'tnb2'], dtype='<U4'),
  array(['igs', 'abbz'], dtype='<U4'),
  array(['igs', 'amu2'], dtype='<U4'),
  array(['igs', 'arht'], dtype='<U4'),
  array(['igs', 'benn'], dtype='<U4'),
  array(['igs', 'buri'], dtype='<U4'),
  array(['igs', 'cas1'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'con2'], dtype='<U4'),
  array(['igs', 'cote'], dtype='<U4'),
  array(['igs', 'flm5'], dtype='<U4'),
  array(['igs', 'ftp4'], dtype='<U4'),
  array(['igs', 'hog2'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'mcm4'], dtype='<U4'),
  array(['igs', 'mcmd'], dtype='<U4'),
  array(['igs', 'min0'], dtype='<U4'),
  array(['igs', 'mtak'], dtype='<U4'),
  array(['igs', 'phig'], dtype='<U4'),
  array(['igs', 'rob4'], dtype='<U4'),
  array(['igs', 'sctb'], dtype='<U4'),
  array(['igs', 'tomo'], dtype='<U4'),
  array(['igs', 'vl01'], dtype='<U4'),
  array(['igs', 'vl12'], dtype='<U4')],
 [array(['igs', 'benn'], dtype='<U4'),
  array(['igs', 'clrk'], dtype='<U4'),
  array(['igs', 'haag'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'min0'], dtype='<U4'),
  array(['igs', 'mtak'], dtype='<U4'),
  array(['igs', 'sctb'], dtype='<U4'),
  array(['igs', 'tomo'], dtype='<U4'),
  array(['igs', 'vl01'], dtype='<U4'),
  array(['igs', 'whtm'], dtype='<U4')],
 [array(['ant', 'zhon'], dtype='<U4'),
  array(['igs', 'amu2'], dtype='<U4'),
  array(['igs', 'buri'], dtype='<U4'),
  array(['igs', 'cas1'], dtype='<U4'),
  array(['igs', 'dav1'], dtype='<U4'),
  array(['igs', 'dumg'], dtype='<U4'),
  array(['igs', 'flm5'], dtype='<U4'),
  array(['igs', 'kerg'], dtype='<U4'),
  array(['igs', 'krgg'], dtype='<U4'),
  array(['igs', 'maw1'], dtype='<U4'),
  array(['igs', 'syog'], dtype='<U4'),
  array(['igs', 'vl30'], dtype='<U4')],
 [array(['ant', 'tnb1'], dtype='<U4'),
  array(['ant', 'tnb2'], dtype='<U4'),
  array(['igs', 'auck'], dtype='<U4'),
  array(['igs', 'cas1'], dtype='<U4'),
  array(['igs', 'ckis'], dtype='<U4'),
  array(['igs', 'darw'], dtype='<U4'),
  array(['igs', 'dumg'], dtype='<U4'),
  array(['igs', 'dund'], dtype='<U4'),
  array(['igs', 'hob2'], dtype='<U4'),
  array(['igs', 'mac1'], dtype='<U4'),
  array(['igs', 'mcar'], dtype='<U4'),
  array(['igs', 'mobs'], dtype='<U4'),
  array(['igs', 'mqzg'], dtype='<U4'),
  array(['igs', 'thti'], dtype='<U4'),
  array(['igs', 'vl30'], dtype='<U4')]]

These are the counts for membership of the 16 produced station clusters:

>>> for thing in a['stations']:
...    print(len(thing))

18
23
17
14
15
24
9
17
20
20
11
11
25
10
12
15

What does this mean?

There is no bug in make_clusters, nor was there one present in September

  1. Algorithmic performance of BisectingQMeans is as expected when setting min_clust_size=1 and opt_clust_size=18:
    • Max algorithmic cluster possible is size of 17
    • Max cluster actually produced from input data was size of 16
  2. Algorithmic performance of over_cluster is as expected when setting (default) values of neighborhood=5 & overlap_points=2:
    • Max algorithmic cluster expansion possible is neighborhood * overlap_points = 10
    • Largest actual cluster expansion is 9 (i.e., max cluster of 25, minus max input cluster of 16)
  3. Other make_clusters calls seem to be working, as only the minimal lines have been changed from master:
diff test.py test2.py                                                         (pgamit) [21]
1c1
< def make_clusters(points, stations):
---
> def make_clusters(self, points, stations, net_limit=NET_LIMIT):
3c3
<     qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
---
>     qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,
21c21
<     stat_labs = np.array(stations)
---
>     stat_labs = stations.labels_array()
24,25c24,25
<         my_stations = []
<         my_cluster_ties = []
---
>         my_stations = StationCollection()
>         my_cluster_ties = StationCollection()
29c29
<             my_stations.append(station)
---
>             my_stations.append(stations[str(station)])
37c37
<             my_cluster_ties.append(statn)
---
>             my_cluster_ties.append(stations[str(statn)])

So what's happening, why are there large subnetworks for 2022 day 180?

My guess is that these lines from pyNetwork are getting executed:

        # find out if this project-day has been processed before
        db_subnets = cnn.query_float('SELECT * FROM gamit_subnets '
                                     'WHERE "Project" = \'%s\' AND "Year" = %i AND '
                                     '"DOY" = %i ORDER BY "subnet"'
                                     % (self.name, date.year, date.doy), as_dict=True)

        stn_active = stations.get_active_stations(date)
        chk_active = check_stations.get_active_stations(date)

        if len(db_subnets) > 0:
            tqdm.write(' >> %s %s %s -> Processing already exists' % (datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                          self.name, date.yyyyddd()))

            # sub-network already exist, put information in lists
            dba_stn = [stn for net in db_subnets for stn in net['stations']]
            # DDG: deprecated, aliases are now fixed and kept constant
            # dba_alias = [alias for net in db_subnets for alias in net['alias']]

            # DDG: deprecated, aliases are now fixed and kept constant
            # make the necessary changes to the stations aliases (so they
            # match those in the database)
            # stations.replace_alias(dba_stn, dba_alias)

            # build the sub-networks using the information in the database
            clusters, backbone, ties = self.recover_subnets(db_subnets,
                                                            stn_active)

Why do I think this is happening? Because there's already an entry in the database for subnetworks on day 180 of 2022:

SELECT * FROM public.gamit_subnets where "DOY"='180' and "Year"='2022'

nonsense

Note that both output from @eckendrick 's initial report, and the public.gamit_subnets list 17 clusters, while running make_clusters produces 16 total clusters for subnetworks.

What is the quick and lazy fix to get production runs up again?

Drop data from both public.gamit_subnets and public.gamit_soln data where "DOY"='180' and "Year"='2022'. However, we're likely to run into the same issue again on previous runs, so...

What are the actual fixes to this issue?

  1. Increase unit test coverage and testing of the pyNetwork module.
    • This will likely require modifying the module to make it more modular; for instance, make_clusters would be easier to call tests on as stand alone function, instead of the class method that it is currently integrated as.
    • The functions that other logic branches call (backbone_delauney, recover_subnets, and add_missing_station) aren't tested, and probably need unit test coverage.
    • The entire __init__ logic of pyNetwork isn't tested, and likely needs unit test coverage
      1. There needs to be an override method or keyword to what is triggering recover_subnets so that we don't get tripped up by this again. If we don't add something in the pgamit codebase for this, the other option is to manually delete the postgres tables and data entries for the days that will be reprocessed, which sounds labor intensive and error prone.
      2. We need a process and template for filing bugs, so that we're working together as team to address these issues. In general, it is incumbent on the person finding the bug to file the bug report with enough detail and a minimally reproducible example for replication. Part of the reason that open source projects adopt this workflow is to reduce the cumulative workload across all contributors... and also because when users are tracking down the information needed to file a bug, they may realize the source of the behavior, such as subnetwork entries already being present for the DOY and run in question.

renaming this issue to reflect the above testing.

espg commented 3 weeks ago

as a reminder, this is the current testing coverage for the pyNetwork module:

Name                                 Stmts   Miss  Cover   Missing
------------------------------------------------------------------
pyNetwork.py                          163    133    18%   53, 58, 63, 66,  
                                                          74-195, 200-252, 
                                                          256-307, 313-349,
                                                          356-392, 397-418
espg commented 3 weeks ago

this is resolved; cluster module is tested, behavior is documented.