Closed espg closed 3 weeks ago
I've had a chance to do some additional testing, and BisectingQMeans
is behaving as expected and discussed above. Here's a gamit run from 2022, day 002 that was giving @eckendrick trouble, using qmean = BisectingQMeans(min_clust_size=2, opt_clust_size=18)
; I'm able to replicated with these parameters the large station cluster:
mpl.pyplot.hist(clust.labels_, bins=clust.n_clusters)
When I implement the solution that I suggested, setting qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18)
, the problem resolves, and defaults to the algorithmic guarantee above (i.e., hard limit on cluster size set by opt_clust_size=n
when min_clust_size=1
:
There should still be unit test added to enforce the behavior in future-- but that unit test will pass with the current code base when it is added.
As mentioned in the first post above:
That's interesting, the testing I did earlier wasn't able to exceed the opt_cluster_size when min_cluster_size was set to 1... this is making me think it might actually be a bug in the over_cluster function. I didn't check passing the QMeans output to over_cluster, but it does get passed to it in our pipeline, and that function might be expanding more than expected for some reason.
...this appears to be what is happening. Here's the same output from the second figure after it is run through over_cluster
:
>>> OC = over_cluster(clust.labels_, stations.T, metric='euclidean', neighborhood=5, overlap_points=2)
>>> np.sum(OC, axis=1)
>>> array([12, 23, 10, 24, 22, 12, 22, 20, 16, 19, 22, 10, 21, 18, 23, 20, 12,
20, 15, 20, 17, 19, 16, 23, 21, 15, 23, 23, 23, 16, 19, 17, 19, 17,
25, 21, 14, 19, 21, 20, 17, 16, 18, 21, 18, 17, 22, 22, 19, 18, 23,
22, 18, 20, 24, 15, 22, 16, 24, 19, 8, 20, 24, 21, 15, 20, 22, 19,
25, 22, 14, 21, 21, 17, 16, 20, 16, 21, 20, 19, 19, 21, 15, 13, 23,
14, 13, 18, 16, 19, 18, 15, 20, 14, 16, 22, 20, 17])
There is expected algorithmic guarantee's out of over_cluster
; given that the neighborhood
parameter sets how many adjacent clusters to include, and that overlap_points
sets the max number of points per neighborhood to join, the output from over_cluster
is capped at adding neighborhood
* overlap_points
at most to the 'BisectingQMeans` clusters. For the above code, those parameters are 5 and 2 respectively, which means at most 10 additional points (stations) added to guaranteed max of 17 (empirically max of 16 for this run). Our current code respects these limits; again, we need a unit test here, but when it is added, it will pass with the current codebase.
So what's happening, why are we getting large networks? One possibility is that there's an odd bug related to the termination condition of over_cluster
. Right now the default parameter for neighborhood
is 5, but @eckendrick is running the code on some sub regions, so what happens when there are less than 5 neighborhoods?
The short answer is that I'm not sure, and we haven't tested for this case, so it might be our problem... or it might behave fine, and we might not actually have a problem at all, but just need to modify some of the parameters.
@eckendrick what is your current input into over_cluster
? If the neighborhood
and overlap_points
are set to 5 and 2 respectively, the max addition is 10 stations... however, if they're both set to 5, then the max addition is 25. Can you check what those are set as?
As I mentioned last week--
Changing the over_cluster parameters might help right now, and I'll need to test that function too when I debug in more detail next week.
I don't know what the exact behavior is for over_cluster
when neighborhood
is >= to total number of clusters out of BisectingQMeans
, and will need to do some testing. However, I do know what the behavior is when total clusters is > neighborhood
, and that is that the code works. One thing that we can do while I write the unit tests and debug some of the edge case behavior is to avoid the edge cases. You can do this by changing the over_cluster
parameters, specifically setting neighborhood
equal to 3. Given our cutoff between when we run the make_clusters
vs when we don't, I expect that we'll always have at least 3 clusters since we're clustering a minimum of 45 points. Note that we could in fact guarantee this behavior by setting min_clust_size
= 1, and opt_clust_size
= 15, as a 45 station submission to bisectingQMeans
will definitionally require at least 3 clusters to be made.
We still have the network type parameter which is specified as global or regional. Maybe you can use this to set the parameters accordingly?
That sounds reasonable, I'll look at adding in a conditional parameter switch for that inside of pyNetwork
when I code up the unit tests.
Ok, thanks. I forget what is the name of the parameter but it is in the CFG for the GAMIT run. Eric can provide you with one.
I need @eckendrick to clarify where and how his runs are diverging from expected behavior. From the unit tests, everything looks fine and within algorithmic expectations. Put another way, this output:
grep min_clust_size pyNetwork.py
qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
but my networks are still large. I'm currently testing for our Antarctic processing,
so there are ~130 stations each day. Here are the sizes of the subnets for 2022 day 180
g2regain.OSU00/monitor.log:2024-09-11 21:23:52 -> About to execute GAMIT - system count G: 44
g2regain.OSU01/monitor.log:2024-09-11 21:22:51 -> About to execute GAMIT - system count G: 29
g2regain.OSU02/monitor.log:2024-09-11 21:22:32 -> About to execute GAMIT - system count G: 35
g2regain.OSU03/monitor.log:2024-09-11 21:21:18 -> About to execute GAMIT - system count G: 27
g2regain.OSU04/monitor.log:2024-09-11 21:19:29 -> About to execute GAMIT - system count G: 22
g2regain.OSU05/monitor.log:2024-09-11 21:21:15 -> About to execute GAMIT - system count G: 40
g2regain.OSU06/monitor.log:2024-09-11 21:19:06 -> About to execute GAMIT - system count G: 20
g2regain.OSU07/monitor.log:2024-09-11 21:20:18 -> About to execute GAMIT - system count G: 22
g2regain.OSU08/monitor.log:2024-09-11 21:20:24 -> About to execute GAMIT - system count G: 27
g2regain.OSU09/monitor.log:2024-09-11 21:20:07 -> About to execute GAMIT - system count G: 29
g2regain.OSU10/monitor.log:2024-09-11 21:21:43 -> About to execute GAMIT - system count G: 39
g2regain.OSU11/monitor.log:2024-09-11 21:19:23 -> About to execute GAMIT - system count G: 18
g2regain.OSU12/monitor.log:2024-09-11 21:19:51 -> About to execute GAMIT - system count G: 22
g2regain.OSU13/monitor.log:2024-09-11 21:20:25 -> About to execute GAMIT - system count G: 32
g2regain.OSU14/monitor.log:2024-09-11 21:20:41 -> About to execute GAMIT - system count G: 35
g2regain.OSU15/monitor.log:2024-09-11 21:20:22 -> About to execute GAMIT - system count G: 36
g2regain.OSU16/monitor.log:2024-09-11 21:21:55 -> About to execute GAMIT - system count G: 29
...does not look problematic to me. There are no station clusters that are above 50. If we want smaller station clusters, the max size is bounded by this equation:
*Max cluster size <= opt_clust_size
+ (neighborhood
overlap_points
), when min_clust_size=1
**
With opt_clust_size
& min_clust_size=1
being set in BisectingQMeans
, and neighborhood
& overlap_points
being set in over_cluster
.
I am still working on enhancements to the underlying BisectingQMeans
algorithm, and will look at @demiangomez suggestion to change the default parameter settings that go into over_cluster
depending on if we are using a regional vs a global run. That said, this should not be a blocking issue for processing with ParallelGAMIT in the present moment, given that all the algorithm parameters can be set to get the station cluster sizes that you want.
Hi Shane, I misunderstood your comment about the algorithm refusing to generate any clusters that have a membership larger than whatever opt_clust_size is set to. Since I had opt_clust_size set to 18, I was expecting networks 01 and larger to have a maximum size of 18. Eric " a temporary fix that will work immediately (and that I've tested) is to set the parameter min_clust_size equal to one (1). Doing this will make the opt_clust_size parameter a hard parameter, i.e., the algorithm will refuse to generate any clusters that have a membership larger than whatever opt_clust_size is set to."
@eckendrick can you tell me what your over_cluster
parameter values for neighborhood
and overlap_points
was set to for those runs above?
If they were set to neighborhood=5
and overlap_points=2
, then we still have a bug since 18 + (5*2) = 28, and you have clusters in the 30's and 40's.
On the other hand, if the values are neighborhood=5
and overlap_points=5
, then we might not have a bug (assuming that the OSU00 net doesn't get passed to make_clusters
if it's a single cluster under the 45 station member network backbone threshold).
We still have the network type parameter which is specified as global or regional. Maybe you can use this to set the parameters accordingly?
@demiangomez I've given this some more thought, and did something else in #104 which explicitly checks for when the number of clusters going into over_cluster
is low (close to / equal to parameter neighborhood
), and adjusts the neighborhood
parameter value downwards so it doesn't ever exceed the number of clusters that go into that function... so I don't think we'll need to put a case switch for regional vs global networks now.
@espg my settings for neighborhood and overlap_points are neighborhood=5, overlap_points=2
@demiangomez the large networks reported are not replicable in isolation when using the parameters that I recommended on Sept 10th and 11th-- as a reminder, this was the what @eckendrick reported:
I tried changing min_clust_size from 2 to 1, and kept opt_clust_size at 18
grep min_clust_size pyNetwork.py qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
but my networks are still large. I'm currently testing for our Antarctic processing, so there are ~130 stations each day. Here are the sizes of the subnets for 2022 day 180
g2regain.OSU00/monitor.log:2024-09-11 21:23:52 -> About to execute GAMIT - system count G: 44 g2regain.OSU01/monitor.log:2024-09-11 21:22:51 -> About to execute GAMIT - system count G: 29 g2regain.OSU02/monitor.log:2024-09-11 21:22:32 -> About to execute GAMIT - system count G: 35 g2regain.OSU03/monitor.log:2024-09-11 21:21:18 -> About to execute GAMIT - system count G: 27 g2regain.OSU04/monitor.log:2024-09-11 21:19:29 -> About to execute GAMIT - system count G: 22 g2regain.OSU05/monitor.log:2024-09-11 21:21:15 -> About to execute GAMIT - system count G: 40 g2regain.OSU06/monitor.log:2024-09-11 21:19:06 -> About to execute GAMIT - system count G: 20 g2regain.OSU07/monitor.log:2024-09-11 21:20:18 -> About to execute GAMIT - system count G: 22 g2regain.OSU08/monitor.log:2024-09-11 21:20:24 -> About to execute GAMIT - system count G: 27 g2regain.OSU09/monitor.log:2024-09-11 21:20:07 -> About to execute GAMIT - system count G: 29 g2regain.OSU10/monitor.log:2024-09-11 21:21:43 -> About to execute GAMIT - system count G: 39 g2regain.OSU11/monitor.log:2024-09-11 21:19:23 -> About to execute GAMIT - system count G: 18 g2regain.OSU12/monitor.log:2024-09-11 21:19:51 -> About to execute GAMIT - system count G: 22 g2regain.OSU13/monitor.log:2024-09-11 21:20:25 -> About to execute GAMIT - system count G: 32 g2regain.OSU14/monitor.log:2024-09-11 21:20:41 -> About to execute GAMIT - system count G: 35 g2regain.OSU15/monitor.log:2024-09-11 21:20:22 -> About to execute GAMIT - system count G: 36 g2regain.OSU16/monitor.log:2024-09-11 21:21:55 -> About to execute GAMIT - system count G: 29
Here's the code running the exact same data input on the make_clusters
function from pyNetwork
:
from NetClusters import BisectingQMeans, over_cluster, select_central_point
from pyStation import StationCollection
import pandas as pd
import numpy as np
# same input data
df = pd.read_csv('/Users/espg/Downloads/gamit_subnets_180_2022.csv')
xyz = df[['X','Y','Z']].values
points = np.stack((x,y,z))
stations = list(zip(df['NetworkCode'], df['StationCode']))
# Same 'make_clusters' function, with minimal lines modified for output handling
def make_clusters(points, stations):
# Run initial clustering using bisecting 'q-means'
# Function call matching @eckendrick parameters
qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
init='random', n_init=50, algorithm='lloyd',
max_iter=8000, random_state=42, n_clusters=2)
qmean.fit(points)
# snap centroids to closest station coordinate
central_points = select_central_point(qmean.labels_, points,
qmean.cluster_centers_)
# expand the initial clusters to overlap stations with neighbors
OC = over_cluster(qmean.labels_, points, metric='euclidean',
neighborhood=5, overlap_points=2)
# calculate all 'tie' stations
ties = np.where(np.sum(OC, axis=0) > 1)[0]
# monotonic labels, compatible with previous data structure / api
cluster_labels = []
station_labels = []
cluster_ties = []
# init'ed outside of the loop for efficiency...
stat_labs = np.array(stations) #stations.labels_array()
for row, cluster in enumerate(OC):
# Create 'station collections' for compat
my_stations = [] #StationCollection()
my_cluster_ties = [] #StationCollection()
# strip out station id's per cluster...
for station in stat_labs[cluster]:
# rebuild as a 'station collection list'
# my_stations.append(stations[str(station)])
my_stations.append(station) # modified (above line) for testing
# append to a regular list for integer indexing at line ~400
station_labels.append(my_stations)
cluster_labels.append(np.ones((1, np.sum(cluster)),
dtype=np.int_).squeeze()*row)
# strip out station id's for tie points....
for statn in stat_labs[ties[np.isin(ties, np.where(cluster)[0])]]:
# rebuild as a 'station collection list'
#my_cluster_ties.append(stations[str(statn)])
my_cluster_ties.append(statn) # modified (above line) for testing
# append to a regular list for integer indexing at line ~400
cluster_ties.append(my_cluster_ties)
# put everything in a dictionary
clusters = {'centroids': points[central_points],
'labels': cluster_labels,
'stations': station_labels}
return clusters, cluster_ties
a, b = make_clusters(points.T, stations)
These are the output clusters (after BisectingQmeans
, over_cluster
, and associated parsing):
>>> a['stations']
[[array(['igs', 'badg'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'coco'], dtype='<U4'),
array(['igs', 'daej'], dtype='<U4'),
array(['igs', 'darw'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'guam'], dtype='<U4'),
array(['igs', 'hob2'], dtype='<U4'),
array(['igs', 'hrao'], dtype='<U4'),
array(['igs', 'iisc'], dtype='<U4'),
array(['igs', 'kiru'], dtype='<U4'),
array(['igs', 'mal2'], dtype='<U4'),
array(['igs', 'mcil'], dtype='<U4'),
array(['igs', 'mobs'], dtype='<U4'),
array(['igs', 'nklg'], dtype='<U4'),
array(['igs', 'pohn'], dtype='<U4'),
array(['igs', 'pol2'], dtype='<U4'),
array(['igs', 'reun'], dtype='<U4')],
[array(['igs', 'chpi'], dtype='<U4'),
array(['igs', 'ckis'], dtype='<U4'),
array(['igs', 'cro1'], dtype='<U4'),
array(['igs', 'drao'], dtype='<U4'),
array(['igs', 'fair'], dtype='<U4'),
array(['igs', 'falk'], dtype='<U4'),
array(['igs', 'gode'], dtype='<U4'),
array(['igs', 'holm'], dtype='<U4'),
array(['igs', 'hrao'], dtype='<U4'),
array(['igs', 'kiru'], dtype='<U4'),
array(['igs', 'kokb'], dtype='<U4'),
array(['igs', 'kour'], dtype='<U4'),
array(['igs', 'mal2'], dtype='<U4'),
array(['igs', 'mas1'], dtype='<U4'),
array(['igs', 'mate'], dtype='<U4'),
array(['igs', 'mkea'], dtype='<U4'),
array(['igs', 'nklg'], dtype='<U4'),
array(['igs', 'pie1'], dtype='<U4'),
array(['igs', 'pol2'], dtype='<U4'),
array(['igs', 'sthl'], dtype='<U4'),
array(['igs', 'stjo'], dtype='<U4'),
array(['igs', 'thti'], dtype='<U4'),
array(['igs', 'vndp'], dtype='<U4')],
[array(['igs', 'chpi'], dtype='<U4'),
array(['igs', 'dav1'], dtype='<U4'),
array(['igs', 'hrao'], dtype='<U4'),
array(['igs', 'jctw'], dtype='<U4'),
array(['igs', 'kerg'], dtype='<U4'),
array(['igs', 'krgg'], dtype='<U4'),
array(['igs', 'mal2'], dtype='<U4'),
array(['igs', 'maw1'], dtype='<U4'),
array(['igs', 'nklg'], dtype='<U4'),
array(['igs', 'reun'], dtype='<U4'),
array(['igs', 'simo'], dtype='<U4'),
array(['igs', 'sthl'], dtype='<U4'),
array(['igs', 'suth'], dtype='<U4'),
array(['igs', 'syog'], dtype='<U4'),
array(['igs', 'vesl'], dtype='<U4'),
array(['tig', 'cztg'], dtype='<U4'),
array(['tig', 'hnus'], dtype='<U4')],
[array(['igs', 'bren'], dtype='<U4'),
array(['igs', 'falk'], dtype='<U4'),
array(['igs', 'fos1'], dtype='<U4'),
array(['igs', 'frei'], dtype='<U4'),
array(['igs', 'kepa'], dtype='<U4'),
array(['igs', 'lply'], dtype='<U4'),
array(['igs', 'ohi3'], dtype='<U4'),
array(['igs', 'parc'], dtype='<U4'),
array(['igs', 'rgdg'], dtype='<U4'),
array(['igs', 'rio2'], dtype='<U4'),
array(['igs', 'roth'], dtype='<U4'),
array(['igs', 'sgp5'], dtype='<U4'),
array(['igs', 'thur'], dtype='<U4'),
array(['igs', 'vesl'], dtype='<U4')],
[array(['ans', 'belg'], dtype='<U4'),
array(['ant', 'smr5'], dtype='<U4'),
array(['igs', 'bren'], dtype='<U4'),
array(['igs', 'bsa1'], dtype='<U4'),
array(['igs', 'fos1'], dtype='<U4'),
array(['igs', 'hown'], dtype='<U4'),
array(['igs', 'lply'], dtype='<U4'),
array(['igs', 'rgdg'], dtype='<U4'),
array(['igs', 'roth'], dtype='<U4'),
array(['igs', 'sgp3'], dtype='<U4'),
array(['igs', 'sgp4'], dtype='<U4'),
array(['igs', 'sgp5'], dtype='<U4'),
array(['igs', 'trve'], dtype='<U4'),
array(['igs', 'wiln'], dtype='<U4'),
array(['igs', 'wlch'], dtype='<U4')],
[array(['ans', 'belg'], dtype='<U4'),
array(['ant', 'sprz'], dtype='<U4'),
array(['arg', 'mbio'], dtype='<U4'),
array(['igs', 'bren'], dtype='<U4'),
array(['igs', 'bsa1'], dtype='<U4'),
array(['igs', 'capf'], dtype='<U4'),
array(['igs', 'dupt'], dtype='<U4'),
array(['igs', 'fos1'], dtype='<U4'),
array(['igs', 'frei'], dtype='<U4'),
array(['igs', 'hown'], dtype='<U4'),
array(['igs', 'hugo'], dtype='<U4'),
array(['igs', 'lply'], dtype='<U4'),
array(['igs', 'ohi2'], dtype='<U4'),
array(['igs', 'ohi3'], dtype='<U4'),
array(['igs', 'pal2'], dtype='<U4'),
array(['igs', 'palm'], dtype='<U4'),
array(['igs', 'rgdg'], dtype='<U4'),
array(['igs', 'rio2'], dtype='<U4'),
array(['igs', 'robn'], dtype='<U4'),
array(['igs', 'sgp1'], dtype='<U4'),
array(['igs', 'sgp3'], dtype='<U4'),
array(['igs', 'sgp4'], dtype='<U4'),
array(['igs', 'sgp5'], dtype='<U4'),
array(['igs', 'spgt'], dtype='<U4')],
[array(['ant', 'sprz'], dtype='<U4'),
array(['igs', 'chpi'], dtype='<U4'),
array(['igs', 'cro1'], dtype='<U4'),
array(['igs', 'falk'], dtype='<U4'),
array(['igs', 'frei'], dtype='<U4'),
array(['igs', 'kepa'], dtype='<U4'),
array(['igs', 'kour'], dtype='<U4'),
array(['igs', 'sgp5'], dtype='<U4'),
array(['igs', 'sthl'], dtype='<U4')],
[array(['igs', 'bean'], dtype='<U4'),
array(['igs', 'bren'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'fos1'], dtype='<U4'),
array(['igs', 'gmez'], dtype='<U4'),
array(['igs', 'haag'], dtype='<U4'),
array(['igs', 'hown'], dtype='<U4'),
array(['igs', 'hton'], dtype='<U4'),
array(['igs', 'jnsn'], dtype='<U4'),
array(['igs', 'lntk'], dtype='<U4'),
array(['igs', 'mtjn'], dtype='<U4'),
array(['igs', 'pirt'], dtype='<U4'),
array(['igs', 'sgp3'], dtype='<U4'),
array(['igs', 'sgp4'], dtype='<U4'),
array(['igs', 'trve'], dtype='<U4'),
array(['igs', 'wiln'], dtype='<U4'),
array(['igs', 'wlch'], dtype='<U4')],
[array(['igs', 'berp'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'gldk'], dtype='<U4'),
array(['igs', 'haag'], dtype='<U4'),
array(['igs', 'hown'], dtype='<U4'),
array(['igs', 'lntk'], dtype='<U4'),
array(['igs', 'lply'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'mmrp'], dtype='<U4'),
array(['igs', 'mrtp'], dtype='<U4'),
array(['igs', 'mtak'], dtype='<U4'),
array(['igs', 'mtjn'], dtype='<U4'),
array(['igs', 'pirt'], dtype='<U4'),
array(['igs', 'sgp4'], dtype='<U4'),
array(['igs', 'sltr'], dtype='<U4'),
array(['igs', 'thur'], dtype='<U4'),
array(['igs', 'tomo'], dtype='<U4'),
array(['igs', 'trve'], dtype='<U4'),
array(['igs', 'wlch'], dtype='<U4'),
array(['igs', 'wlrd'], dtype='<U4')],
[array(['ans', 'belg'], dtype='<U4'),
array(['igs', 'amu2'], dtype='<U4'),
array(['igs', 'bean'], dtype='<U4'),
array(['igs', 'benn'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'crdi'], dtype='<U4'),
array(['igs', 'haag'], dtype='<U4'),
array(['igs', 'hown'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'mtjn'], dtype='<U4'),
array(['igs', 'pece'], dtype='<U4'),
array(['igs', 'pirt'], dtype='<U4'),
array(['igs', 'rmbo'], dtype='<U4'),
array(['igs', 'stew'], dtype='<U4'),
array(['igs', 'vesl'], dtype='<U4'),
array(['igs', 'whtm'], dtype='<U4'),
array(['igs', 'wiln'], dtype='<U4'),
array(['igs', 'wlch'], dtype='<U4'),
array(['igs', 'wlct'], dtype='<U4'),
array(['igs', 'wlrd'], dtype='<U4')],
[array(['ans', 'belg'], dtype='<U4'),
array(['ant', 'zhon'], dtype='<U4'),
array(['arg', 'mbio'], dtype='<U4'),
array(['igs', 'bren'], dtype='<U4'),
array(['igs', 'crdi'], dtype='<U4'),
array(['igs', 'hton'], dtype='<U4'),
array(['igs', 'maw1'], dtype='<U4'),
array(['igs', 'syog'], dtype='<U4'),
array(['igs', 'trve'], dtype='<U4'),
array(['igs', 'vesl'], dtype='<U4'),
array(['igs', 'wlch'], dtype='<U4')],
[array(['ant', 'tnb1'], dtype='<U4'),
array(['ant', 'tnb2'], dtype='<U4'),
array(['igs', 'benn'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'mac1'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'vl01'], dtype='<U4'),
array(['igs', 'vl12'], dtype='<U4'),
array(['igs', 'vl30'], dtype='<U4')],
[array(['ant', 'tnb1'], dtype='<U4'),
array(['ant', 'tnb2'], dtype='<U4'),
array(['igs', 'abbz'], dtype='<U4'),
array(['igs', 'amu2'], dtype='<U4'),
array(['igs', 'arht'], dtype='<U4'),
array(['igs', 'benn'], dtype='<U4'),
array(['igs', 'buri'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'con2'], dtype='<U4'),
array(['igs', 'cote'], dtype='<U4'),
array(['igs', 'flm5'], dtype='<U4'),
array(['igs', 'ftp4'], dtype='<U4'),
array(['igs', 'hog2'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'mcm4'], dtype='<U4'),
array(['igs', 'mcmd'], dtype='<U4'),
array(['igs', 'min0'], dtype='<U4'),
array(['igs', 'mtak'], dtype='<U4'),
array(['igs', 'phig'], dtype='<U4'),
array(['igs', 'rob4'], dtype='<U4'),
array(['igs', 'sctb'], dtype='<U4'),
array(['igs', 'tomo'], dtype='<U4'),
array(['igs', 'vl01'], dtype='<U4'),
array(['igs', 'vl12'], dtype='<U4')],
[array(['igs', 'benn'], dtype='<U4'),
array(['igs', 'clrk'], dtype='<U4'),
array(['igs', 'haag'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'min0'], dtype='<U4'),
array(['igs', 'mtak'], dtype='<U4'),
array(['igs', 'sctb'], dtype='<U4'),
array(['igs', 'tomo'], dtype='<U4'),
array(['igs', 'vl01'], dtype='<U4'),
array(['igs', 'whtm'], dtype='<U4')],
[array(['ant', 'zhon'], dtype='<U4'),
array(['igs', 'amu2'], dtype='<U4'),
array(['igs', 'buri'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'dav1'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'flm5'], dtype='<U4'),
array(['igs', 'kerg'], dtype='<U4'),
array(['igs', 'krgg'], dtype='<U4'),
array(['igs', 'maw1'], dtype='<U4'),
array(['igs', 'syog'], dtype='<U4'),
array(['igs', 'vl30'], dtype='<U4')],
[array(['ant', 'tnb1'], dtype='<U4'),
array(['ant', 'tnb2'], dtype='<U4'),
array(['igs', 'auck'], dtype='<U4'),
array(['igs', 'cas1'], dtype='<U4'),
array(['igs', 'ckis'], dtype='<U4'),
array(['igs', 'darw'], dtype='<U4'),
array(['igs', 'dumg'], dtype='<U4'),
array(['igs', 'dund'], dtype='<U4'),
array(['igs', 'hob2'], dtype='<U4'),
array(['igs', 'mac1'], dtype='<U4'),
array(['igs', 'mcar'], dtype='<U4'),
array(['igs', 'mobs'], dtype='<U4'),
array(['igs', 'mqzg'], dtype='<U4'),
array(['igs', 'thti'], dtype='<U4'),
array(['igs', 'vl30'], dtype='<U4')]]
These are the counts for membership of the 16 produced station clusters:
>>> for thing in a['stations']:
... print(len(thing))
18
23
17
14
15
24
9
17
20
20
11
11
25
10
12
15
make_clusters
, nor was there one present in SeptemberBisectingQMeans
is as expected when setting min_clust_size=1
and opt_clust_size=18
:
over_cluster
is as expected when setting (default) values of neighborhood=5
& overlap_points=2
:
neighborhood
* overlap_points
= 10make_clusters
calls seem to be working, as only the minimal lines have been changed from master:diff test.py test2.py (pgamit) [21]
1c1
< def make_clusters(points, stations):
---
> def make_clusters(self, points, stations, net_limit=NET_LIMIT):
3c3
< qmean = BisectingQMeans(min_clust_size=1, opt_clust_size=18,
---
> qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,
21c21
< stat_labs = np.array(stations)
---
> stat_labs = stations.labels_array()
24,25c24,25
< my_stations = []
< my_cluster_ties = []
---
> my_stations = StationCollection()
> my_cluster_ties = StationCollection()
29c29
< my_stations.append(station)
---
> my_stations.append(stations[str(station)])
37c37
< my_cluster_ties.append(statn)
---
> my_cluster_ties.append(stations[str(statn)])
My guess is that these lines from pyNetwork
are getting executed:
# find out if this project-day has been processed before
db_subnets = cnn.query_float('SELECT * FROM gamit_subnets '
'WHERE "Project" = \'%s\' AND "Year" = %i AND '
'"DOY" = %i ORDER BY "subnet"'
% (self.name, date.year, date.doy), as_dict=True)
stn_active = stations.get_active_stations(date)
chk_active = check_stations.get_active_stations(date)
if len(db_subnets) > 0:
tqdm.write(' >> %s %s %s -> Processing already exists' % (datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
self.name, date.yyyyddd()))
# sub-network already exist, put information in lists
dba_stn = [stn for net in db_subnets for stn in net['stations']]
# DDG: deprecated, aliases are now fixed and kept constant
# dba_alias = [alias for net in db_subnets for alias in net['alias']]
# DDG: deprecated, aliases are now fixed and kept constant
# make the necessary changes to the stations aliases (so they
# match those in the database)
# stations.replace_alias(dba_stn, dba_alias)
# build the sub-networks using the information in the database
clusters, backbone, ties = self.recover_subnets(db_subnets,
stn_active)
Why do I think this is happening? Because there's already an entry in the database for subnetworks on day 180 of 2022:
SELECT * FROM public.gamit_subnets where "DOY"='180' and "Year"='2022'
Note that both output from @eckendrick 's initial report, and the public.gamit_subnets
list 17 clusters, while running make_clusters
produces 16 total clusters for subnetworks.
Drop data from both public.gamit_subnets
and public.gamit_soln
data where "DOY"='180' and "Year"='2022'
. However, we're likely to run into the same issue again on previous runs, so...
pyNetwork
module.
make_clusters
would be easier to call tests on as stand alone function, instead of the class method that it is currently integrated as.backbone_delauney
, recover_subnets
, and add_missing_station
) aren't tested, and probably need unit test coverage.__init__
logic of pyNetwork isn't tested, and likely needs unit test coverage
recover_subnets
so that we don't get tripped up by this again. If we don't add something in the pgamit codebase for this, the other option is to manually delete the postgres tables and data entries for the days that will be reprocessed, which sounds labor intensive and error prone.renaming this issue to reflect the above testing.
as a reminder, this is the current testing coverage for the pyNetwork
module:
Name Stmts Miss Cover Missing
------------------------------------------------------------------
pyNetwork.py 163 133 18% 53, 58, 63, 66,
74-195, 200-252,
256-307, 313-349,
356-392, 397-418
this is resolved; cluster
module is tested, behavior is documented.
@demiangomez @eckendrick moving this discussion off of email and onto github so we can track it.
This is the original bug report from @eckendrick
My response on fixing this using
min_clust_size=1
:@eckendrick 's follow up testing:
...and my thought's from last week: