Keep tracer during masking

robjmcgibbon commented 3 months ago

This change checks whether all particles have been removed during the masking step. If so then a tracer is added back in. I am not converting the subhalo to an orphan at this point, but since it will have only one tracer it will be converted to an orphan in subhalo_unbind anyway.

When masking is called the TracerIndex of a subhalo may be incorrect. I think this is due to gas particles which formed stars being removed by KickNullParticles during particle exchange. I therefore have to locate the first tracer as part of this change. If Victor's changes guarantee that TracerIndex is correct, then I can remove this step from my changes.

I have been able to run this on Flamingo L1000N0900/HYDRO_FIDUCIAL with the CHECK_TRACER_INDEX compile option enabled.

I'll need to remove some comments and run the formatter before merging.

jchelly commented 3 months ago

For central subhalos the particle list has been replaced with the FoF particles in FeedCentrals() before masking happens. I think that scrambles the ordering by binding energy. The code currently tries to keep the old most bound particle at the beginning of the array. Do we need to modify FeedCentrals() to put the selected tracer at the beginning and update the tracer index if possible?

VictorForouhar commented 3 months ago

Fixed that. What should we do if the previous tracer is no longer located in the FOF? Re-add or ignore? If the latter, we could be selecting any particle, as done prior to this fix. I am leaning towards re-adding it (but what if it is in a different FOF?)...

jchelly commented 3 months ago

I'm not sure. Could we have some simple fallback for picking a tracer in that case? E.g. pick a particle close to the centre of mass, and make it a tracer type if possible?

jchelly commented 3 months ago

Also, do we need to worry about setting a tracer in the new subhalos which get put in empty FoF groups in FeedCentrals()? I guess probably not because they can't have subhalos and so don't get any particles masked out, and the masking is the only thing that touches the tracer index before we Unbind() and reset the tracer.

jchelly commented 3 months ago

We should at least have TracerIndex default to an invalid value so that debug builds will definitely crash if we GetTracer() without having SetTracer().

VictorForouhar commented 3 months ago

I guess the most 'simple' fallback would be to test the next collisionless tracer we had in the subhalo.

jchelly commented 3 months ago

Yes, I was forgetting that we still have the ordered list of particles from the old subhalo at this point, so we can find the ID of the next tracer in the old list and check if it exists in the new list.

VictorForouhar commented 3 months ago

Rob was arguing against that, in the sense of what would happen if we don't find any of the 10 tracers we look for in the FOF.

jchelly commented 3 months ago

The subhalo got assigned to the FoF by AssignHosts checking which FoF the tracers belong to. Doesn't that guarantee that there are tracers in common between the subhalo and the FoF when we get to FeedCentrals()?

jchelly commented 3 months ago

Running this on L1000N0900/HYDRO_FIDUCIAL, this happens:

saving 1001996 subhalos to ./output//023
Conversion factor from SWIFT length units to 0.681 Mpc/h = 1
Conversion factor from SWIFT mass units to 6.81e+09 Msun/h = 1
Conversion factor from SWIFT velocity units to 1 km/s = 1
Null group ID is 2147483647
Number of ranks per node reading simultaneously is 128
1457994480 particles loaded at Snapshot 24(24)
1025534 groups loaded at snapshot 24(24)
terminate called after throwing an instance of 'std::runtime_error'
  what():  No tracer particle from previous snapshot found in FOF

So it looks like we are somehow getting centrals where none of the previous subhalo's tracers are present.

robjmcgibbon commented 3 months ago

Did that run include Victor's recent change?

jchelly commented 3 months ago

I ran it with the latest commit in this branch - e45798c . Is there another change I need?

robjmcgibbon commented 3 months ago

Sorry for not being clear. I meant Victor's FOF changes, but actually that shouldn't make a difference

jchelly commented 3 months ago

That's a good point. I should have merged this into the latest orphan_tracers branch to test it. I just cloned the repository and checked out this branch.

jchelly commented 3 months ago

I have the code stopped in ddt at the point where it throws the exception (subhalo_tracking.cpp line 555). At this point Host.Particles contains one element (a gas particle) and central.Particles has 42 elements. I think that means that the code has tried to assign the FoF particles to an orphan which was on a gas particle, so there were no tracers available.

VictorForouhar commented 3 months ago

I thought @robjmcgibbon had run a test on FLAMINGO and found no orphans with incorrect particle types... Did you also check for orphans with no tracer?

robjmcgibbon commented 3 months ago

The files /snap8/scratch/dp004/dc-mcgi1/hbt_KeepTracerDuringMasking/keep_tracer_during_masking_clean/ don't have any tracerless orphans. Tomorrow I'll have a look at the changes since then to see what might be different.

jchelly commented 3 months ago

Two differences I can see between the parameters of Rob's run and mine: I have MinNumTracerPartOfSub=20 instead of 10, and I'm reading the snapshots from the /snapshots directory instead of /fof_snapshots.

jchelly commented 3 months ago

Maybe the problem is I'm running on an input catalogue where a FoF group can have no tracers? Not sure if that's the case but I'll try again using the fof_snapshot directory.

VictorForouhar commented 3 months ago

We should definitely use the updated FOFs. However, I think Rob and I were ~ convinced that that would not directly solve the problem. To have an orphan, you'd need to form an object with a minimum number of tracers (20 in your case). This means we should have a collisionless tracer if we found it to be in the FOF group in question.

jchelly commented 3 months ago

That makes sense. We would only get a crash if the code somehow converts FoFs which have never been bound into orphans.

Something that's not clear to me: how does the code know not to make an orphan from a new FoF which is immediately determined to be unbound (e.g. because of having no tracers)? FeedCentrals() sets Nbound to the number of particles in the FoF group so Nbound>1 doesn't necessarily mean it used to be a bound subhalo.

jchelly commented 3 months ago

RegisterNewTracks() looks like it should dispose of any spurious orphans.

VictorForouhar commented 3 months ago

That is what I was going to say; it should not be a problem on that side.

VictorForouhar commented 3 months ago

Have you run it with the new fof catalogues?

jchelly commented 3 months ago

It's running now. I submitted the job last night but with too short a time limit.

VictorForouhar commented 3 months ago

Has it at least reached the troublesome snapshot? Do you have the path to the catalogue version where the code crashed?

jchelly commented 3 months ago

No, it's at snapshot 12 and it needs to get to 24 before we know if it crashes. The run is here: /cosma8/data/dp004/jch/HBT-tests/keep_tracer_during_masking/L1000N0900/HYDRO_FIDUCIAL

jchelly commented 3 months ago

It has now passed the point where it crashed before and it's still running.

jchelly commented 3 months ago

But then it did crash on snapshot 26 with the same error.

VictorForouhar commented 3 months ago

Oh, no! What about using the same version (f9cfaa38457943c61653aa2dba9fa9c4b49c33ef) Rob used to validate his changes?

robjmcgibbon commented 3 months ago

I'm running a job with the original version that worked, and another with an up-to-date version with a potential fix

robjmcgibbon commented 3 months ago

When saving a tracer to add back after masking I was using the first particle in the list of particles. This was done because in FeedCentrals we move the most bound tracer to the first position in the list. However, if there are satellites that have all their tracers masked out by their subsubhalos, then we shouldn't be adding back the first particle from the list, since there is no guarantee it is a tracer.

I am now using GetTracerIndex when picking the particle to save. This means I have to skip masking of newly created centrals, since their TracerIndex is equal to -1, and so calling GetTracerIndex would throw an error.

Output of the run that finished is at /snap8/scratch/dp004/dc-mcgi1/hbt_KeepTracerDuringMasking/UseTracerIndex

robjmcgibbon commented 3 months ago

In FeedCentrals I was using GetTracerIndex as the starting point to find the most bound tracer. However, I don't think TracerIndex is guaranteed to be correct at this point in the code, since we call KickNullParticles shortly before FeedCentrals. This also means that I think we need to recalculate the TracerIndex for satellites in the masking step.

I have started a test run with a fix for this problem. It will be output to /snap8/scratch/dp004/dc-mcgi1/hbt_KeepTracerDuringMasking/FindTracer

jchelly commented 3 months ago

The KickNullParticles problem should have triggered the check in GetTracer(). Is it just rare enough that we haven't hit it in test runs?

jchelly commented 3 months ago

Looking at the code again, I added a call to CountParticleTypes() to recompute TracerIndex in KickNullParticles so that's probably preventing the crash.

jchelly commented 3 months ago

Hopefully the other fix (not assuming the tracer is the first particle) will prevent the crash I was seeing. It does sound like it could have produced orphans on the wrong particle type.

robjmcgibbon commented 3 months ago

Ok makes sense. I'll revert 24e92cc then if it's not needed

jchelly commented 3 months ago

The crash I had seems to be fixed. My L1000N0900/HYDRO_FIDUCIAL run is up to snapshot 43 now. This is with commit 24e92cc197bddb26593eebf1618aea7bd70b844a .

SWIFTSIM / HBTplus

Keep tracer during masking #20