SWIFTSIM / HBTplus

HBTplus halo finder adapted for the FLAMINGO and COLIBRE simulations
0 stars 0 forks source link

Distribute particles between MPI ranks using hash of particle ID #27

Closed jchelly closed 1 month ago

jchelly commented 2 months ago

Currently HBT arranges snapshot particles for easy lookup by ID by putting different ranges of IDs on different MPI ranks. The algorithm used to determine the assignment of ID ranges to ranks seems to become very slow on hydro runs where the IDs are not a contiguous sequence.

This pull request modifies the code to distribute particles according to the hash of their ID. Assigning particles to ranks is then just done by rank = hash(id) % comm_size, which is simple and fast but we rely on the hash values being evenly distributed to ensure reasonable load balancing.

jchelly commented 2 months ago

This is not ready to merge. I get a failed assert() on the small Colibre test,

jchelly commented 2 months ago

The problem was that in SubhaloSnapshot_t::UpdateMostBoundPosition we're looking for particles by ID but we don't know their type so we don't know if they should be found. I've set Type=TypeMax in this case to avoid triggering the check. This didn't show up previously because I was using the orphan_tracers branch so the tracers could never disappear.

jchelly commented 2 months ago

Now that the orphan_tracers branch is merged we always expect tracers to exist, so when the particle exchanger is called from UpdateMostBoundPosition we can assert that all particles should be found.

jchelly commented 2 months ago

On the small Colibre test this now gives almost identical output to master if I run on one thread and disable random sampling. If I look at the final output the number of subhalos is identical and all of their properties are identical except for the Rank (many differences), Depth (one subhalo differs) and SnapshotIndexOfLastIsolation (one subhalo differs).

I think maybe I've inadvertently changed the ordering of subhalos that have the same mass within their host halo.

jchelly commented 2 months ago

If I modify CompareMass_t to use the TrackId as a tie breaker when the masses are equal, then this branch and master produce output that is identical at snapshot 15 except for rounding error level differences in SpecificSelfPotentialEnergy for five halos.

jchelly commented 1 month ago

This now sorts subhalos within a halo by Vmax where they have the same bound mass. The vmax calculation has to be moved so that it happens before the sorting operation.

jchelly commented 1 month ago

Using Vmax instead of TrackId doesn't seem to be enough to make the ranks match between this and master, possibly due to orphans with Vmax=0. I've implemented Matthieu's suggestion of using the most bound particle ID where Vmax is equal.

jchelly commented 1 month ago

In this branch not all duplicate particles have been removed, so using the MostBoundParticleId isn't enough either! Some pairs of subhalos in the Colibre test have Mbound=0, Vmax=0, and equal most bound particle IDs!

jchelly commented 1 month ago

To test this I'm comparing output between two versions of the code:

This should test just the effect of distributing the snapshot particles differently. I didn't use master because it still has a buffer overflow until remove_duplicate_particles_in_mergers is merged.

jchelly commented 1 month ago

I left a pair of Colibre test runs going last night and they produced identical results apart from a bit of rounding error in SpecificSelfPotentialEnergy. So it appears that partitioning the particles differently between ranks has no effect on the output as long as we specify a definite ranking order within halos.

I'll take a look at the rest of your comments now.

jchelly commented 1 month ago

I'll check my test still works with these changes and then I think this is ready to merge.

VictorForouhar commented 1 month ago

Excellent, thanks! :D

jchelly commented 1 month ago

The test output matches and it completes with asserts enabled, so I'l merge this now.