demiangomez / Parallel.GAMIT

Python wrapper to parallelize GAMIT executions
BSD 3-Clause "New" or "Revised" License
37 stars 20 forks source link

Strange behavior and interaction in GamitSession and PyNetwork between clusters, tie_points, and backbone #116

Open espg opened 4 days ago

espg commented 4 days ago

This issue is closely tied this discussion, so please read the linked content before continuing.

Examining the data from this query:

SELECT * FROM public.gamit_subnets where "DOY"='180' and "Year"='2022'

Shows interesting behavior:

df = pd.read_csv('/Users/espg/Downloads/gamit_subnets_180_2022.csv')
df.iloc[1].stations

Which outputs the following (note the color highlight)

$${\color{blue}igs.badg,igs.cas1,igs.coco,igs.daej,igs.darw,igs.dumg,igs.guam,}$$ $${\color{blue}igs.hob2,igs.hrao,igs.iisc,igs.kiru,igs.mal2,igs.mcil,igs.mobs,igs.nklg,igs.pohn,igs.pol2,igs.reun,}$$ $${\color{red}igs.cas1,igs.darw,igs.dumg,igs.hob2,igs.hrao,igs.kiru,igs.mal2,igs.mobs,igs.nklg,igs.pol2,igs.reun}$$

All of the red entries above are duplicates of stations already listed in the blue highlighting.

For public.gamit_subnets on DOY of 2022, there are 17 listed clusters in the data table, with the first cluster (labeled subnet 0) being the backbone network. That leaves 16 clusters, which correspond to the 16 clusters that make_clusters produces. Since index zero in the postgres data table corresponds to the backbone, the indexing is off by 1; i.e., df.iloc[1].stations compares to a[0] and b[0] from a, b = make_clusters(points.T, stations), with "a" and "b" being the clusters dictionary and cluster_ties list respectively.

This is the zero-th entry for cluster stations from the clusters dictionary-- note that it's identical to the blue highlighted text from public.gamit_subnets table for DOY 180 in 2022:

>>> a['stations'][0]
[array(['igs', 'badg'], dtype='<U4'),
 array(['igs', 'cas1'], dtype='<U4'),
 array(['igs', 'coco'], dtype='<U4'),
 array(['igs', 'daej'], dtype='<U4'),
 array(['igs', 'darw'], dtype='<U4'),
 array(['igs', 'dumg'], dtype='<U4'),
 array(['igs', 'guam'], dtype='<U4'),
 array(['igs', 'hob2'], dtype='<U4'),
 array(['igs', 'hrao'], dtype='<U4'),
 array(['igs', 'iisc'], dtype='<U4'),
 array(['igs', 'kiru'], dtype='<U4'),
 array(['igs', 'mal2'], dtype='<U4'),
 array(['igs', 'mcil'], dtype='<U4'),
 array(['igs', 'mobs'], dtype='<U4'),
 array(['igs', 'nklg'], dtype='<U4'),
 array(['igs', 'pohn'], dtype='<U4'),
 array(['igs', 'pol2'], dtype='<U4'),
 array(['igs', 'reun'], dtype='<U4')]

Now, this is the output from the cluster_ties list, which is identical to the red highlighted text from public.gamit_subnets table for DOY 180 in 2022:

>>> b[0]
 [array(['igs', 'cas1'], dtype='<U4'),
 array(['igs', 'darw'], dtype='<U4'),
 array(['igs', 'dumg'], dtype='<U4'),
 array(['igs', 'hob2'], dtype='<U4'),
 array(['igs', 'hrao'], dtype='<U4'),
 array(['igs', 'kiru'], dtype='<U4'),
 array(['igs', 'mal2'], dtype='<U4'),
 array(['igs', 'mobs'], dtype='<U4'),
 array(['igs', 'nklg'], dtype='<U4'),
 array(['igs', 'pol2'], dtype='<U4'),
 array(['igs', 'reun'], dtype='<U4')]

Looking at two additional entries from public.gamit_subnets and the clusters dictionary & cluster_ties list confirms the pattern.

Questions

  1. Was this the case with earlier runs that @eckendrick was doing, such as public.gamit_soln 2022 days 001-008?
    • If not, this might be a bug with these lines that check for tie points repeats on load from the database
    • If the tie and stations are getting added together inside of GamitSession, we can fix the issue with the code from the previous bullet or similar
  2. What is default and preferred behavior for handling stations, and should subnetwork stations include the tie stations?
    • Reading this comment, it looks like currently GamitSession wants these two data objects (tie points and station clusters) not to overlap.
    • Regardless of what the current default behavior is, we should intentionally determine what makes sense for the behavior to be, and if we want to change it.
    • Having the clusters include the tie stations (or not) will impact other downstream code, such as how subnetwork plots are currently handled.
    • Having the clusters include the tie stations (or not) will also impact the 'check' that's run when determining how large the subnetworks are (should it be the 'base' size of the clusters, or the 'expanded' size that includes the tie points)
    • @demiangomez my intuition is that it will make more sense to change the behavior in GamitSession than what is setup in pyNetwork
  3. Regardless of what the default behavior is or where the tie stations and subnetworks are being double merged, we should be testing for repeats:
    • With unit tests that tell us (and fail submitted PRs) if the control logic needlessly duplicates entries
    • With runtime checks that can detect, fix and remove duplicate stations before time consuming numerics
espg commented 4 days ago

@pdsmith90 you asked a couple of weeks ago if unit tests were hard to write... this might be a good first issue to get experience with them if you want some practice writing and setting them up.

We need several tests, but the most simple unit test that we need from above is something that takes a list of stations and checks to see if there are any duplicate entries within that list. We'd likely end up integrating that unit test with other related unit tests in the same test file-- for example, a test that first combines tie stations and subnetworks, and then checks a.) if all the entries from each data structure are present, and b.) if any of those entries are duplicates.

To see how the tests are setup in general, have a look here at how my clustering tests are setup. The link above points to my Parallel.Gamit fork, since those unit tests and the automated framework to run them aren't in our master branch yet. Once @demiangomez merges #109 , any unit tests that you write will automatically run on any push you make to an open pull request into master (or any other branch), which saves you from having to setup the testing framework locally when you're just getting started with these.

demiangomez commented 4 days ago

@espg : the behavior is to include ALL station in the stations array in the database (including ties). Method recover_subnets appends all stations except those in the ties list. The ties are later on appended to the processing list by GamitSession and, if the processing is not "ready", then a new record is inserted in the database

see https://github.com/demiangomez/Parallel.GAMIT/blob/25a82931640f34b80f0d96f8a2fc8358716de16c/pgamit/pyNetwork.py#L338-L340

and https://github.com/demiangomez/Parallel.GAMIT/blob/25a82931640f34b80f0d96f8a2fc8358716de16c/pgamit/pyGamitSession.py#L147-L151

Also, I checked with Eric and days 120-239 were run with -purge, which deletes everything from the database. So this is not an issue from before. There is something in the code that is duplicating stations.

espg commented 4 days ago

There is something in the code that is duplicating stations.

That something in the code that's duplicating happens right here (from 90 onwards), in pyGamitSession:

https://github.com/demiangomez/Parallel.GAMIT/blob/25a82931640f34b80f0d96f8a2fc8358716de16c/pgamit/pyGamitSession.py#L79-L100

...and is documented as happening on the last quoted line here:

https://github.com/demiangomez/Parallel.GAMIT/blob/25a82931640f34b80f0d96f8a2fc8358716de16c/pgamit/pyGamitSession.py#L32-L35

@demiangomez the question is should the stations and station ties be merged into the station_instances list inside of GamitSession? The initial cut on clustering (https://github.com/demiangomez/Parallel.GAMIT/commit/fbb575090e16871a5382a2d124f70309eaea5ac8) didn't include station ties at all, since they were included already in the station list. The reason that the station ties list were added back in was for compatibility in the kml output which marks the tie points with different glyphs. The reason that we don't run the tie point extraction code inside of GamitSession is because we need the full dataset of input coords to do the cluster overlap, and we only pass the station subset to the GamitSession's that are run in parallel.

Right now we have a situation where we:

  1. Initially have a station list that includes ties
  2. Have 6 explicit lines of code in pyNetwork to pull out tie stations for kml compatibility
  3. Have new plotting code already in master for plotting the clustering networks
  4. Have additional code in GamitSession that has the express purpose of replicating what we already have at number 1 of this list

If we decide to address this inside of pyNetwork instead of inside GamitSession, this is what our code flow will look like:

  1. Generate station list with subnetwork and ties
  2. Determine ties as a separate list
  3. Remove ties from station list
  4. Pass station and ties list to GamitSession
  5. Merge station and ties list (again) inside of GamitSession

There are probably better options. We could, as a few possibilities, do some, all, or none of:

  1. Pass the existing station list with the ties to GamitSession and remove the tie + station merging code
  2. Write the stations and ties directly to that database from pyNetwork if we're worried about keeping track of ties
  3. Remove the ties variable altogether since being a tie station has no bearing on the how the actual numerics work for the gamit run
demiangomez commented 4 days ago

Line 90 onwards of pyGamitSession is not duplicating stations that go into the database. It is combining the instances from stations and ties for the GAMIT run. If stations are duplicated when inserted to the database, then it is because the station list contains the ties + stations (created in pyNetwork), as far as I can see.

@demiangomez the question is should the stations and station ties be merged into the station_instances list inside of GamitSession?

As I mentioned before, YES

The initial cut on clustering (fbb5750) didn't include station ties at all, since they were included already in the station list. The reason that the station ties list were added back in was for compatibility in the kml output which marks the tie points with different glyphs.

The are other reasons why we need the ties identified, kml being one, but it is not the most important reason. The point here is that we need to make sure that the station set passed to pyGamitSession does not include the ties. Conclusion: we need pyNetwork to create a station list that does not have the ties and a ties list that contains the ties only. These are then added together by pyGamitSession but they can be separated because a tie list exists. Please change this ASAP so that we can test everything.

Thanks