celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/
Other
64 stars 35 forks source link

Extend debug utilities for stuck/errored tracks #1451

Closed sethrj closed 1 month ago

sethrj commented 1 month ago

While debugging a hang in ATLAS, we tried the debug_print method documented in our manual to see track information, but the symbol wasn't available due to use of LTO and/or -Wl,--exclude-libs,ALL and/or static libs. By adding a secret environment-based call, which we've placed in the most sensible location of KernelContextException to ensure that symbol stays in the library.

This also adds a kill_active method to the Stepper that can be called in a debugger or in between steps. It marks all tracks as "errored" and will apply a tracking cut at the next step, which (on CPU) also prints diagnostic information about the track.

A new SetupOptions::geometry_output_file can be used to export the in-memory geometry alongside the physics and primaries to aid in reproducing errors.

Finally, this greatly enhances the diagnostic output from errored/cut tracks, now printing out a JSON dump of the track with names substituted via core params metadata. This is accomplished by hiding a host-only "observer pointer" to CoreParams inside the core params data. Because that pointer is only usable on the host and only needed inside kernels, it shouldn't result in people abusing it.

Killing track {"geo":{"dir":[1.0,0.0,0.0],"is_on_boundary":true,"is_outside":false,"pos":[[-5.0,0.0,0.0],"cm"],"volume_id":"inner@0x0"},"particle":{"energy":[[0.0,"MeV"],"MeV"],"particle_id":"gamma"},"sim":{"event_id":0,"num_steps":1,"parent_id":-1,"post_step_action":"tracking-cut","status":"errored","step_length":[17.0,"cm"],"time":[0.25,"s"],"track_id":1},"thread_id":7,"track_slot_id":7}: lost 100 MeV

1144

github-actions[bot] commented 1 month ago

Test summary

 3 300 files   5 097 suites   3m 30s :stopwatch:  1 536 tests  1 508 :white_check_mark: 28 :zzz: 0 :x: 17 012 runs  16 949 :white_check_mark: 63 :zzz: 0 :x:

Results for commit 9fafc143.

:recycle: This comment has been updated with latest results.

drbenmorgan commented 1 month ago

This looks o.k. to me modulo the tests passing!