Closed espg closed 1 month ago
@demiangomez this is ready for a review/merge. Note that this PR also changes the clustering behavior to be deterministic by fixing the random seed, so that re-running a PG run will now produce identical clustering each time (for reproducibility).
@demiangomez I've added in the network plotting function call to this PR -- so you will probably want to merge both this and PR #90 at the same time... otherwise, if this is merged and #90 isn't, there will be an error about the missing plot function...
Ok, I will. I'm delaying this merge because with Eric we noticed a big performance issue in the cluster. We want to discard any geometry issues and Eric is now running a test using the previous code. Right now we are seeing that jobs are taking an order of magnitude longer than they used to. This is most likely due to an issue in the cluster, not with the code, but we want to make sure before we talk to ASCTech.
Can you communicate this to Eric please?
On Thu, Aug 22, 2024, 16:23 Shane @.***> wrote:
@.**** commented on this pull request.
In parallel_gamit/pyNetwork.py https://github.com/demiangomez/Parallel.GAMIT/pull/87#discussion_r1727765638 :
@@ -193,8 +194,9 @@ def init(self, cnn, archive, GamitConfig, stations, date,
def make_clusters(self, points, stations, net_limit=NET_LIMIT): # Run initial clustering using bisecting 'q-means'
- qmean = BisectingQMeans(n_clusters=2, init='random', n_init=50,
- algorithm='lloyd', max_iter=8000)
- qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,
@demiangomez https://github.com/demiangomez answering you question on slack-- the quickest thing to try is to change the min_clust_size parameter from "4" to "2". Right now qmeans is refusing to execute any cluster splits that result in a child cluster with membership of less than
- Setting this to 2 will mean that the only cluster size that isn't allowed is a single lone station. Because this is pre overcluster and the cluster membership will grow to overlap neighboring subnetworks, I'd expect some population of network size around >5 and >10 bins, where the current parameters setting has a smallest cluster size of 22. You could also drop opt_clust_size if you want to move the mode of the distribution down from 34...
— Reply to this email directly, view it on GitHub https://github.com/demiangomez/Parallel.GAMIT/pull/87#pullrequestreview-2255584130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECLQ2S4TKULMIB7KSRTZYLZSZCEFAVCNFSM6AAAAABMRNOQYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGU4DIMJTGA . You are receiving this because you were mentioned.Message ID: @.***>
Hi Shane, The version of PG I am using is the master I installed from GitHub on 7August. Do I need to install your PR87 before those suggested changes to pyNetwork.py would work ? Eric
On Thu, Aug 22, 2024 at 4:33 PM Demian Gomez @.***> wrote:
Can you communicate this to Eric please?
On Thu, Aug 22, 2024, 16:23 Shane @.***> wrote:
@.**** commented on this pull request.
In parallel_gamit/pyNetwork.py https://github.com/demiangomez/Parallel.GAMIT/pull/87#discussion_r1727765638 :
@@ -193,8 +194,9 @@ def init(self, cnn, archive, GamitConfig, stations, date,
def make_clusters(self, points, stations, net_limit=NET_LIMIT):
Run initial clustering using bisecting 'q-means'
- qmean = BisectingQMeans(n_clusters=2, init='random', n_init=50,
- algorithm='lloyd', max_iter=8000)
- qmean = BisectingQMeans(min_clust_size=4, opt_clust_size=20,
@demiangomez https://github.com/demiangomez answering you question on slack-- the quickest thing to try is to change the min_clust_size parameter from "4" to "2". Right now qmeans is refusing to execute any cluster splits that result in a child cluster with membership of less than
- Setting this to 2 will mean that the only cluster size that isn't allowed is a single lone station. Because this is pre overcluster and the cluster membership will grow to overlap neighboring subnetworks, I'd expect some population of network size around >5 and >10 bins, where the current parameters setting has a smallest cluster size of 22. You could also drop opt_clust_size if you want to move the mode of the distribution down from 34...
— Reply to this email directly, view it on GitHub https://github.com/demiangomez/Parallel.GAMIT/pull/87#pullrequestreview-2255584130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECLQ2S4TKULMIB7KSRTZYLZSZCEFAVCNFSM6AAAAABMRNOQYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJVGU4DIMJTGA . You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@eckendrick do you mean Parallel.GAMIT when you say 'PG'? Yes, you'd need to update to this ( PR #87 ) code to try changing the parameters, they aren't user settable prior to these updates.
Demian and Shane, I installed PR87 so I could test making cluster sizes smaller, but as Shane mentioned, when I try to run this I get the error about the missing plot function. What should my next step be so that I can test smaller cluster sizes ?
@eckendrick you can install the previous commit (i.e., d2e616452b6c920ffdb3857cb9d3d04f6a7df0c3 ) using either git or pip for this PR branch which doesn't include the plot code-- see here for more details on how this is done in pip: https://stackoverflow.com/questions/13685920/install-specific-git-commit-with-pip
Another option is to just comment out lines 36 and 240-244 from what you have already downloaded.
This PR merges/ports the BSD licensed functions that were adapted from sklearn directly into PG; these functions were previously hosted in the 'overcluster' library, but are directly added to PG in this PR now that both licenses are BSD.
Note that adding these functions directly into PG does not remove sklearn as a requirement; we still are using the sklearn
neighbors
module to expand the cluster neighborhoods, and the modified bisectingQMeans function still calls a few validation functions fromsklearn.utils
as well as importing base classes fromsklearn.kmeans*
to inherit from.As written at the time of this PR, this code can be merged 'as is' and should run without error. It is labeled [WIP] for 'Work In Progress' pending some non-critical issues could/should be fixed. These are:bisectingQMeans
are still reflective of thebisectingKMeans
algorithm from which it was changed, rather than the new class/function. Theovercluster
function doc string is also out of date and needs to be cleaned up.bisectingQMeans
. ThebisectingKMeans
takes parametern_clusters
as the 'K' in 'KMeans', which is the termination condition for the algorithm. ForQMeans
, this parameter is treated as the number of clusters to start bisecting from, so it is a starting condition and specifically not a termination condition. Furthermore, the termination conditions are currently hardcoded ; they should be available as settable parameters though. If this item is 'checked', then that means that a commit has added the termination parameters as callable / settable parameters (with updated docstrings).bisectingKMeans
class allows for a 'bisecting_strategy' set to either inertia or largest cluster... but the modified `bisectingQMeans only has one strategy, and the unused strategy code and documentation needs to be removed. (never called as is, but is still confusing and unnecessary)