Port functions from 'overcluster' library directly into ParallelGAMIT [Merge]

espg commented 2 months ago

This PR merges/ports the BSD licensed functions that were adapted from sklearn directly into PG; these functions were previously hosted in the 'overcluster' library, but are directly added to PG in this PR now that both licenses are BSD.

Note that adding these functions directly into PG does not remove sklearn as a requirement; we still are using the sklearn neighbors module to expand the cluster neighborhoods, and the modified bisectingQMeans function still calls a few validation functions from sklearn.utils as well as importing base classes from sklearn.kmeans* to inherit from.

As written at the time of this PR, this code can be merged 'as is' and should run without error. It is labeled [WIP] for 'Work In Progress' pending some non-critical issues could/should be fixed. These are:

[x] Documentation updates. The docstrings for bisectingQMeans are still reflective of the bisectingKMeans algorithm from which it was changed, rather than the new class/function. The overcluster function doc string is also out of date and needs to be cleaned up.
[x] Modularization / changing + adding parameters for bisectingQMeans. The bisectingKMeans takes parameter n_clusters as the 'K' in 'KMeans', which is the termination condition for the algorithm. For QMeans, this parameter is treated as the number of clusters to start bisecting from, so it is a starting condition and specifically not a termination condition. Furthermore, the termination conditions are currently hardcoded ; they should be available as settable parameters though. If this item is 'checked', then that means that a commit has added the termination parameters as callable / settable parameters (with updated docstrings).
[x] The base bisectingKMeans class allows for a 'bisecting_strategy' set to either inertia or largest cluster... but the modified `bisectingQMeans only has one strategy, and the unused strategy code and documentation needs to be removed. (never called as is, but is still confusing and unnecessary)

espg commented 2 months ago

@demiangomez this is ready for a review/merge. Note that this PR also changes the clustering behavior to be deterministic by fixing the random seed, so that re-running a PG run will now produce identical clustering each time (for reproducibility).

espg commented 2 months ago

@demiangomez I've added in the network plotting function call to this PR -- so you will probably want to merge both this and PR #90 at the same time... otherwise, if this is merged and #90 isn't, there will be an error about the missing plot function...

demiangomez commented 2 months ago

Ok, I will. I'm delaying this merge because with Eric we noticed a big performance issue in the cluster. We want to discard any geometry issues and Eric is now running a test using the previous code. Right now we are seeing that jobs are taking an order of magnitude longer than they used to. This is most likely due to an issue in the cluster, not with the code, but we want to make sure before we talk to ASCTech.

demiangomez commented 2 months ago

Can you communicate this to Eric please?

On Thu, Aug 22, 2024, 16:23 Shane @.***> wrote:

@.**** commented on this pull request.

In parallel_gamit/pyNetwork.py https://github.com/demiangomez/Parallel.GAMIT/pull/87#discussion_r1727765638 :

@@ -193,8 +194,9 @@ def init(self, cnn, archive, GamitConfig, stations, date,
 def make_clusters(self, points, stations, net_limit=NET_LIMIT):
     # Run initial clustering using bisecting 'q-means'
qmean = BisectingQMeans(n_clusters=2, init='random', n_init=50,

algorithm='lloyd', max_iter=8000)

qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,

@demiangomez https://github.com/demiangomez answering you question on slack-- the quickest thing to try is to change the min_clust_size parameter from "4" to "2". Right now qmeans is refusing to execute any cluster splits that result in a child cluster with membership of less than

Setting this to 2 will mean that the only cluster size that isn't allowed is a single lone station. Because this is pre overcluster and the cluster membership will grow to overlap neighboring subnetworks, I'd expect some population of network size around >5 and >10 bins, where the current parameters setting has a smallest cluster size of 22. You could also drop opt_clust_size if you want to move the mode of the distribution down from 34...

— Reply to this email directly, view it on GitHub https://github.com/demiangomez/Parallel.GAMIT/pull/87#pullrequestreview-2255584130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECLQ2S4TKULMIB7KSRTZYLZSZCEFAVCNFSM6AAAAABMRNOQYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGU4DIMJTGA . You are receiving this because you were mentioned.Message ID: @.***>

eckendrick commented 2 months ago

Hi Shane, The version of PG I am using is the master I installed from GitHub on 7August. Do I need to install your PR87 before those suggested changes to pyNetwork.py would work ? Eric

On Thu, Aug 22, 2024 at 4:33 PM Demian Gomez @.***> wrote:

Can you communicate this to Eric please?

On Thu, Aug 22, 2024, 16:23 Shane @.***> wrote:

@.**** commented on this pull request.

In parallel_gamit/pyNetwork.py https://github.com/demiangomez/Parallel.GAMIT/pull/87#discussion_r1727765638 :

@@ -193,8 +194,9 @@ def init(self, cnn, archive, GamitConfig, stations, date,

def make_clusters(self, points, stations, net_limit=NET_LIMIT):

Run initial clustering using bisecting 'q-means'

qmean = BisectingQMeans(n_clusters=2, init='random', n_init=50,

algorithm='lloyd', max_iter=8000)

qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,

@demiangomez https://github.com/demiangomez answering you question on slack-- the quickest thing to try is to change the min_clust_size parameter from "4" to "2". Right now qmeans is refusing to execute any cluster splits that result in a child cluster with membership of less than

Setting this to 2 will mean that the only cluster size that isn't allowed is a single lone station. Because this is pre overcluster and the cluster membership will grow to overlap neighboring subnetworks, I'd expect some population of network size around >5 and >10 bins, where the current parameters setting has a smallest cluster size of 22. You could also drop opt_clust_size if you want to move the mode of the distribution down from 34...

— Reply to this email directly, view it on GitHub https://github.com/demiangomez/Parallel.GAMIT/pull/87#pullrequestreview-2255584130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECLQ2S4TKULMIB7KSRTZYLZSZCEFAVCNFSM6AAAAABMRNOQYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGU4DIMJTGA . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

espg commented 2 months ago

@eckendrick do you mean Parallel.GAMIT when you say 'PG'? Yes, you'd need to update to this ( PR #87 ) code to try changing the parameters, they aren't user settable prior to these updates.

eckendrick commented 2 months ago

Demian and Shane, I installed PR87 so I could test making cluster sizes smaller, but as Shane mentioned, when I try to run this I get the error about the missing plot function. What should my next step be so that I can test smaller cluster sizes ?

espg commented 2 months ago

@eckendrick you can install the previous commit (i.e., d2e616452b6c920ffdb3857cb9d3d04f6a7df0c3 ) using either git or pip for this PR branch which doesn't include the plot code-- see here for more details on how this is done in pip: https://stackoverflow.com/questions/13685920/install-specific-git-commit-with-pip

Another option is to just comment out lines 36 and 240-244 from what you have already downloaded.

espg commented 2 months ago

@demiangomez the path output should be fixed for this now (see here for the new output path definition )

I think that with this PR's #87 , #89 , and #90 are all ready to be merged and run on the unity cluster

demiangomez / Parallel.GAMIT

Port functions from 'overcluster' library directly into ParallelGAMIT [Merge] #87

@.**** commented on this pull request.

@.**** commented on this pull request.

Run initial clustering using bisecting 'q-means'