celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 32 forks source link

Debug assertion failure: FieldDriver.hh:264: `succeeded` #619

Closed amandalund closed 1 year ago

amandalund commented 1 year ago

Failure:

    "failure": {
     "stderr": [
      "terminate called after throwing an instance of 'celeritas::RuntimeError'",
      "  what():  /home/alund/celeritas_project/celeritas/src/corecel/data/DeviceAllocation.cc:61:",
      "celeritas: CUDA error: cudaFree(ptr) failed:",
      "[PID=44534, TID=-1][ 7/15]> /soft/compilers/gcc/11.1.0/x86_64-suse-linux/lib64/libgcc_s.so.1(_Unwind_RaiseException+0x311) [0x7f9071f9a4d1]"
     ],
     "stdout": [
      "/home/alund/celeritas_project/celeritas/src/celeritas/field/FieldDriver.hh:264:",
      "celeritas: internal assertion failed: succeeded"
     ]
    }

Regression problem:

cms2018+field+msc-vecgeom-gpu

JSON input
{
"_geometry": "vecgeom",
"_instance": 1,
"_name": [
"cms2018+field+msc",
"vecgeom",
"gpu"
],
"_num_events": 7,
"_num_primaries": 9100,
"_outdir": "cms2018+field+msc-vecgeom-gpu",
"_timeout": 600.0,
"brem_combined": false,
"enable_diagnostics": false,
"geant_options": {
"coulomb_scattering": false,
"eloss_fluctuation": false,
"em_bins_per_decade": 56, 
"lpm": true,
"msc": "urban",
"physics": "em_basic",
"rayleigh_scattering": true
},
"geometry_filename": "/home/alund/celeritas_project/regression/input/cms2018.gdml",
"hepmc3_filename": "/home/alund/celeritas_project/regression/input/simple-cms-13TeV.hepmc3",
"initializer_capacity": 67108864,
"mag_field": [
0.0,
0.0,
1.0
],
"max_events": 7,
"max_num_tracks": 1048576,
"max_steps": 2048,
"physics_filename": "/home/alund/celeritas_project/regression/input/cms2018.gdml",
"secondary_stack_factor": 3.0,
"seed": 20220905,
"sync": true,
"use_device": true
}

Diagnosis:

max_nsteps = 100 in the field driver is not always sufficient: occasionally 2-3x more iterations are required to converge (e.g. 311, 124, 109, 108, 312, 133, 307, 148, 259, 159, 234, 174, 139, 116, 112, ...).

Note:

After increasing max_nsteps, the problem crashes with a Warp Out-of-range Address error.

Backtrace
#0  0x00000000052d5080 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#1  0x00000000052d5240 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#2  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#3  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#4  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#5  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#6  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#7  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#8  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#9  0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#10 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#11 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#12 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#13 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#14 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#15 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#16 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#17 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#18 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#19 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#20 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#21 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#22 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#23 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#24 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
om::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#26 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#27 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#28 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#29 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#30 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#31 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#32 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#33 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#34 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#35 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#36 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#37 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#38 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#39 0x00000000052bff40 in vecgeom::cuda::CommonSpecializedVolImplHelper, -1, -1>::PlacedDistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#40 0x00000000052a9e00 in vecgeom::cuda::PlacedVolumeImplHelper, vecgeom::cuda::VPlacedVolume>::DistanceToOut(vecgeom::cuda::Vector3D const&, vecgeom::cuda::Vector3D const&, double) const ()
#41 0x00000000045bd890 in celeritas::FieldPropagator > > >::operator()(double) ()
#42 0x00000000045e5750 in __nv_static_50__29768779_28_AlongStepUniformMscAction_cu_c0b7bc1d__ZN9celeritas61_GLOBAL__N__29768779_28_AlongStepUniformMscAction_cu_c0b7bc1d29along_step_uniform_msc_kernelENS_7CoreRefILNS_8MemSpaceE1EEENS_12UrbanMscDataILNS_9OwnershipE2ELS2_1EEENS_18UniformFieldParamsE<<<(4096,1,1),(256,1,1)>>> ()
mrguilima commented 1 year ago

What is the track position and direction at this crash?

whokion commented 1 year ago

All these failed cases with the number of iterations (nit) above the default FieldDriverOptions::max_nsteps (100)happen when the energy of the input particle is very small, O(eV), with a relatively large step length,O(mm)- so one good step (to advance) within the given relative tolerance is limited by a small delta step, O(0.01mm), equivalently requires many sub-steps to integrate along the given step length. See

nit step (cm) energy (MeV)
311 0.0930969 3.35963e-07
124 0.0315987 3.35963e-07
109 0.0273674 3.35963e-07
108 0.0270646 3.35963e-07
312 0.0937436 3.35963e-07
133 0.0342916 3.35963e-07
307 0.0938099 3.48505e-07
148 0.0393166 3.48505e-07
259 0.0944304 5.31415e-07
159 0.0528794 5.31415e-07
234 0.0946951 6.81869e-07
174 0.0951561 1.35744e-06
139 0.0953275 2.26955e-06
116 0.0953957 3.41271e-06
112 0.0954019 3.63209e-06

So, the question is why these charged particles with very small energy are still survived from the previous step as a particle with energy below eloss_calc_limit (1keV) deposits all its energy locally and should be stopped.

amandalund commented 1 year ago

Right now that cutoff is applied before the energy loss, so it's possible to end up with charged particles with very small post-step energies.

mrguilima commented 1 year ago

This energy cutoff will probably fix issue #616 as well.

whokion commented 1 year ago

So, how do we treat the case that the calculated mean energy loss is very close to the particle energy? - which should matter not only to avoid very small particle energy (to be self-consistent in conjunction with the use of the eloss_cal_limit which serves as a rough approximated limit for the fidelity of the eloss calculation), but also to apply the fluctuation of eloss in a meaningful way.

whokion commented 1 year ago

Propose to close this issue with a summary of investigation and potential solutions or optimizations:

sethrj commented 1 year ago

@whokion Excellent discussion points. Perhaps to close this out: were you able to replicate the assertion failure in a unit test? If so, let's do two little tests: one with nsteps cranked up arbitrarily high and test the post-step position and momentum. Then let's remove the assertion and reduce nsteps to a smaller value, and compare the positional error. Or we could also make this a test in the propagator where we do a bunch of propagations until some path length is reached, and compare the accumulated position/momentum error there.

I think removing the assertion is the expedient solution, but later we'll definitely want some sort of tracking cut to prevent this from occurring in the first place.