Closed jchelly closed 1 month ago
This is not ready to merge. I get a failed assert() on the small Colibre test,
The problem was that in SubhaloSnapshot_t::UpdateMostBoundPosition we're looking for particles by ID but we don't know their type so we don't know if they should be found. I've set Type=TypeMax in this case to avoid triggering the check. This didn't show up previously because I was using the orphan_tracers branch so the tracers could never disappear.
Now that the orphan_tracers branch is merged we always expect tracers to exist, so when the particle exchanger is called from UpdateMostBoundPosition we can assert that all particles should be found.
On the small Colibre test this now gives almost identical output to master if I run on one thread and disable random sampling. If I look at the final output the number of subhalos is identical and all of their properties are identical except for the Rank (many differences), Depth (one subhalo differs) and SnapshotIndexOfLastIsolation (one subhalo differs).
I think maybe I've inadvertently changed the ordering of subhalos that have the same mass within their host halo.
If I modify CompareMass_t to use the TrackId as a tie breaker when the masses are equal, then this branch and master produce output that is identical at snapshot 15 except for rounding error level differences in SpecificSelfPotentialEnergy for five halos.
This now sorts subhalos within a halo by Vmax where they have the same bound mass. The vmax calculation has to be moved so that it happens before the sorting operation.
Using Vmax instead of TrackId doesn't seem to be enough to make the ranks match between this and master, possibly due to orphans with Vmax=0. I've implemented Matthieu's suggestion of using the most bound particle ID where Vmax is equal.
In this branch not all duplicate particles have been removed, so using the MostBoundParticleId isn't enough either! Some pairs of subhalos in the Colibre test have Mbound=0, Vmax=0, and equal most bound particle IDs!
To test this I'm comparing output between two versions of the code:
This should test just the effect of distributing the snapshot particles differently. I didn't use master because it still has a buffer overflow until remove_duplicate_particles_in_mergers is merged.
I left a pair of Colibre test runs going last night and they produced identical results apart from a bit of rounding error in SpecificSelfPotentialEnergy. So it appears that partitioning the particles differently between ranks has no effect on the output as long as we specify a definite ranking order within halos.
I'll take a look at the rest of your comments now.
I'll check my test still works with these changes and then I think this is ready to merge.
Excellent, thanks! :D
The test output matches and it completes with asserts enabled, so I'l merge this now.
Currently HBT arranges snapshot particles for easy lookup by ID by putting different ranges of IDs on different MPI ranks. The algorithm used to determine the assignment of ID ranges to ranks seems to become very slow on hydro runs where the IDs are not a contiguous sequence.
This pull request modifies the code to distribute particles according to the hash of their ID. Assigning particles to ranks is then just done by
rank = hash(id) % comm_size
, which is simple and fast but we rely on the hash values being evenly distributed to ensure reasonable load balancing.