IndexError in `plot_global_network`

espg commented 3 weeks ago

Describe the bug

Posting here to document the minimal reproducible example.

Data file:

Steps/Code to Reproduce

import pandas as pd
import numpy as np
from NetPlots import plot_global_network
from cluster import over_cluster, BisectingQMeans, select_central_point

df = pd.read_csv('/Users/espg/Downloads/data-1729709491370_4_shane.csv')
xyz = df[['auto_x','auto_y','auto_z']].values
points = np.stack((x,y,z))
stations = list(zip(df['NetworkCode'], df['StationCode']))

def make_clusters(points, stations):
    # Run initial clustering using bisecting 'q-means'
    qmean = BisectingQMeans(random_state=42)
    qmean.fit(points)
    # snap centroids to closest station coordinate
    central_points = select_central_point(qmean.labels_, points,
                                          qmean.cluster_centers_)
    # expand the initial clusters to overlap stations with neighbors
    OC = over_cluster(qmean.labels_, points, metric='euclidean',
                      neighborhood=5, overlap_points=2)
    # calculate all 'tie' stations
    ties = np.where(np.sum(OC, axis=0) > 1)[0]
    # monotonic labels, compatible with previous data structure / api
    cluster_labels = []
    station_labels = []
    cluster_ties = []

    # init'ed outside of the loop for efficiency...
    stat_labs = np.array(stations) #stations.labels_array()
    for row, cluster in enumerate(OC):
        # Create 'station collections' for compat
        my_stations = [] #StationCollection()
        my_cluster_ties = [] #StationCollection()
        # strip out station id's per cluster...
        for station in stat_labs[cluster]:
            my_stations.append(station) # modified (above line) for testing 
        # append to a regular list for integer indexing at line ~400
        station_labels.append(my_stations)
        cluster_labels.append(np.ones((1, np.sum(cluster)),
                                       dtype=np.int_).squeeze()*row)
        # strip out station id's for tie points....
        for statn in stat_labs[ties[np.isin(ties, np.where(cluster)[0])]]:
            # rebuild as a 'station collection list'
            #my_cluster_ties.append(stations[str(statn)])
            my_cluster_ties.append(statn) # modified (above line) for testing
        # append to a regular list for integer indexing at line ~400
        cluster_ties.append(my_cluster_ties)

    # put everything in a dictionary
    clusters = {'centroids': points[central_points],
                'labels': cluster_labels,
                'stations': station_labels}

    return central_points, OC, qmean.labels_

a, b, c = make_clusters(points.T, stations)

plot_global_network(a, b, c, points.T, None)

Expected Results

Plots, with no errors...

Actual Results

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[46], line 1
----> 1 plot_global_network(a, b, c, points.T, None)

File [~/software/Parallel.GAMIT/pgamit/NetPlots.py:90](http://localhost:8888/lab/tree/~/software/Parallel.GAMIT/pgamit/NetPlots.py#line=89), in plot_global_network(central_points, OC, labels, points, output_path, lat_lon)
     88 points = points.tolist()
     89 # remove centroid point so it's not repeated
---> 90 points.pop(remove[0])
     91 # add central point to beginning so it's the central connection point
     92 points.insert(0, central_points[label])

IndexError: index 0 is out of bounds for axis 0 with size 0

Versions

949d81efe5102447bd6d52ee1201fba4d0aecc2e

espg commented 3 weeks ago

apparently happening in label/cluster 74 (of 90):

for label in np.unique(c):
    print(label)
    #nodes.append(nx.Graph())
    # Select Cluster
    points_ = np.where(b[label])[0]
    # Flag centroid point
    remove = np.where(points_ == a[label])[0]
    points_ = points_.tolist()
    # remove centroid point so it's not repeated
    points_.pop(remove[0])
    # add central point to beginning so it's the central connection point
    points_.insert(0, a[label])
    #nx.add_star(nodes[label], points)

73
74
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[68], line 10
      8 points_ = points_.tolist()
      9 # remove centroid point so it's not repeated
---> 10 points_.pop(remove[0])
     11 # add central point to beginning so it's the central connection point
     12 points_.insert(0, a[label])

IndexError: index 0 is out of bounds for axis 0 with size 0

espg commented 3 weeks ago

very strange behavior, likely occurring from very strange geometry... here's where the error happens for cluster 74:

# retrieve point indices for cluster 74
points_ = np.where(b[74])[0]
points_

array([  5,   7, 147, 155, 206, 221, 229, 301, 469, 542, 585, 586, 718])

# determine which of the above point indices is the central point
# (so it can be moved to the beginning of a list and used as the center of the plot)
remove = np.where(points_ == a[74])[0]
remove

array([], dtype=int64)

The above is not supposed to be empty, which is why we get the pop error from an empty list. We can see the point we're looking for:

a[74]

... and 721 isn't in the array with values array([ 5, 7, 147, 155, 206, 221, 229, 301, 469, 542, 585, 586, 718])

espg commented 3 weeks ago

Here's the plot of what's going on with these stations:

download-2

The purple squares are stations from the array([ 5, 7, 147, 155, 206, 221, 229, 301, 469, 542, 585, 586, 718])
The red square is the centroid that's reported from BisectingKMeans ; note that I didn't say BisectingQMeans. Our "Qmeans" routine inherits from the "Kmeans" sklearn base-class, and centroid calculation is handled within the base-class.
The blue square is what we get from our select_central_point routine inside of pgamit. The reason that we need this routine is because the red square above isn't actually a station, it's the centroid of the cluster-- we use select_central_point to snap that centroid to the closest station so that we reference a real GPS station as the central point in the plots, and not just something floating in space.

Is the bug inside of `select_central_point` then?

No, not really. The code for that function is quite terse, so there isn't much room for things to go wrong:

https://github.com/demiangomez/Parallel.GAMIT/blob/949d81efe5102447bd6d52ee1201fba4d0aecc2e/pgamit/cluster.py#L59-L62

The red square doesn't look like a centroid to me...

Me neither. Keep in mind that the purple squares show the cluster after it's been expanded, so the mass of the centroid from Kmeans/Qmeans was initially likely to be much further south. (The station in India is from expansion off of the station north of Madagascar... similar for the stations on the Arabian peninsula and in southern Europe-- all those stations are from cluster expansion which occurs after centroid assignment in Kmeans/Qmeans). There's also multiple stations in South Africa that are close together, which was pulling the weighting down towards them when centroid assignment did happen.

Is the blue square really the closest station to the red centroid? Some of the purple stations to the south look closer...

It's the closest station using euclidean distance and ECEF coordinates. One of the purple stations to south might be closer if we use lat/lon coordinates and calculate great circle distance (which we can do). No guarantee that it won't happen again though.

What can we do to fix it?

We have no control over the centroid selection from kmeans, and overwriting it is almost certainly more work than it's worth.

The lazy option is to just wrap this in a try except block, and then pick a random station within the cluster membership to act as central point. The central point is only used in the plots, not the numerics, and this is an edge case-- we don't expect it to happen except rarely, and it will fix the error and let us get on with processing.

A more (human) time intensive fix would be to do the same try/except logic, but recalculate the central point using the cluster coordinates in lat/lon space whenever we hit the exception.

demiangomez commented 3 weeks ago

The centroid of the cluster has one important use for adding stations to the processing after the network has been created. Sometimes stations become available after we've done the processing. To not reprocess the entire day, we add the station to the cluster with the closest centroid to the station we want to add and reprocess just that subnet. Thus, maybe it would be better if you recompute the centroid after expanding the subnet so that better reflects the geometry.

espg commented 3 weeks ago

Pushed updates to the PR will overwrite the central_points with an appropriate centroid that corresponds to a central station within the cluster if over_cluster selects a 'blue' rather than 'purple' station above.

There's likely some redundant code in add_missing_station that can get pruned at some point, but #133 will fix our error and work with the current network module code.

espg commented 3 weeks ago

closing this for now-- can reopen if further testing shows additional errors inside of plot_global_network

demiangomez / Parallel.GAMIT