Open krasznaa opened 6 months ago
No idea. Is this right value to put? Maximum branches per step: 4294967295
, (The name should be maximum branches per seed)
I would set this to 10 and try again
No idea. Is this right value to put?
Maximum branches per step: 4294967295
, (The name should be maximum branches per seed) I would set this to 10 and try again
Note this is $2^{32}- 1$, so most likely an unsigned integer underflow.
Yes I worry that the CKFs are counting the branches forever with the maximum of maximum branches per seed.
No. That's really just the default value in our code.
This was a good suggestion, but apparently that's not where the code gets stuck. :frowning:
[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10
Running Full Tracking Chain Using CUDA
>>> Detector Options <<<
Detector file : geometries/odd/odd-detray_geometry_detray.json
Material file :
Surface rid file : geometries/odd/odd-detray_surface_grids_detray.json
Use detray::detector: yes
Digitization file : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
Input data format : csv
Input directory : odd/geant4_ttbar_mu200/
Number of input events : 1
Number of input events to skip: 2
>>> Clusterization Options <<<
Target cells per partition: 1024
>>> Track Seeding Options <<<
None
>>> Track Finding Options <<<
Track candidates range : 3:100
Minimum step length for the next surface: 0.5 [mm]
Maximum step counts for the next surface: 100
Maximum Chi2 : 30
Maximum branches per step: 10
Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
Constraint step size : 3.40282e+38 [mm]
Overstep tolerance : -100 [um]
Minimum mask tolerance: 1e-05 [mm]
Maximum mask tolerance: 1 [mm]
Search window : 0 x 0
Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
Run performance checks: no
>>> Accelerator Options <<<
Compare with CPU results: no
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
Then it might be KF where the propagation gets stuck. But you really need to let us know where you get stuck. It is very easily done by putting "Hello" in front of each algorithm.
:stuck_out_tongue: Let's not get quite that basic just yet.
It is indeed the fitting. I can at least attach a normal GDB process, even if cuda-gdb
is a bit more tricky.
(gdb) attach 192332
Attaching to process 192332
[New LWP 192333]
[New LWP 192337]
[New LWP 192338]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000077d8f1e67ba9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x000077d8f1e67ba9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1 0x000077d8f1d80864 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2 0x000077d8f1e4927a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3 0x000077d8f20b80a9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x000077d8f1eee57d in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5 0x000077d8f3814105 in ?? () from /home/krasznaa/software/nvidia/cuda-12.4.1/x86_64/lib64/libcudart.so.12
#6 0x000077d8f3875018 in cudaStreamSynchronize () from /home/krasznaa/software/nvidia/cuda-12.4.1/x86_64/lib64/libcudart.so.12
#7 0x000077d8f4819d1d in traccc::cuda::stream::synchronize (this=<optimized out>)
at /data/ssd-1tb/projects/traccc/traccc/device/cuda/src/utils/stream.cpp:62
#8 0x000077d8f482e5cb in _ZNK6traccc4cuda17fitting_algorithmINS_13kalman_fitterIN6detray10rk_stepperIN6covfie10field_viewINS5_7backend8constantINS5_6vector8vector_dIfLm3EEESB_EEEENS3_5cmathIfEENS3_16constrained_stepINS3_6darrayEEENS3_17stepper_rk_policyENS3_8stepping14void_inspectorESH_EENS3_9navigatorIKNS3_8detectorINS3_16default_metadataENS3_15container_typesIN6vecmem13device_vectorENS3_5tupleESH_NSR_20jagged_device_vectorENS3_4dmapEEEEENS3_10navigation14void_inspectorENS3_14intersection2DINS3_18surface_descriptorINS3_6detail11typed_indexINSP_8mask_idsEjjLj4026531840ELj268435455EEENS14_INSP_12material_idsEjjLj4026531840ELj268435455EEEjtEESF_EEEEEEEclERKNS13_18dmulti_view_helperILb1EJNSR_4data11vector_viewINS3_17volume_descriptorINSP_11geo_objectsENS14_INSP_9accel_idsEjjLj4026531840ELj268435455EEES18_EEEENS1G_INS3_11source_linkIS19_EEEENS1G_IN7algebra5cmath10transform3INS1R_6matrix5actorImNS1Q_6matrix10array_typeENS1V_11matrix_typeEfNS1T_11determinant5actorImS1X_fJNS1Y_17partial_pivot_ludImS1X_fNS1R_14element_getterImS1W_fEEJEEENS1Y_10hard_codedImS1X_fS22_JLm2ELm4EEEEEEENS1T_7inverse5actorImS1X_fJNS27_17partial_pivot_ludImS1X_fS22_JEEENS27_10hard_codedImS1X_fS22_JLm2ELm4EEEEEEES22_NS1R_12block_getterImS1W_fEEEEEEEENS1E_ILb1EJNS1G_INS3_4maskINS3_11rectangle2DEtSF_SH_EEEENS1G_INS2J_INS3_11trapezoid2DEtSF_SH_EEEENS1G_INS2J_INS3_9annulus2DEtSF_SH_EEEENS1G_INS2J_INS3_10cylinder2DEtSF_SH_EEEENS1G_INS2J_INS3_21concentric_cylinder2DEtSF_SH_EEEENS1G_INS2J_INS3_6ring2DEtSF_SH_EEEENS1G_INS2J_INS3_4lineILb0EEEtSF_SH_EEEENS1G_INS2J_INS32_ILb1EEEtSF_SH_EEEEEEENS1E_ILb1EJNS1E_ILb1EJNS1G_IjEENS1G_INS3_4bins6singleINS3_13material_slabIfEEEEEENS1G_ISt5arrayIjLm2EEEENS1G_IfEEEEES3L_S3L_S3L_NS1G_IS3E_EENS1G_INS3_12material_rodIfEEEENS1G_INS3_8materialIfSt5ratioILl1ELl1EEEEEES3L_S3L_EEENS1E_ILb1EJNS1E_ILb1EJS3A_NS1G_IS19_EEEEENS1E_ILb1EJS3A_NS1E_ILb1EJNS1G_INS3B_13dynamic_arrayIS19_E4dataEEES3W_EEES3J_S3K_EEES43_S43_S43_EEENS1E_ILb1EJNS1G_INS3C_IjEEEENS1E_ILb1EJS3J_S3K_EEEEEEEEERKSD_RKNS1F_18jagged_vector_viewIS1A_EERKNS_14container_viewIKNS3_22bound_track_parametersISF_EEKNS_11measurementEEE (this=0x7fff49da5790, det_view=..., field_view=..., navigation_buffer=...,
track_candidates_view=...) at /data/ssd-1tb/projects/traccc/traccc/device/cuda/src/fitting/fitting_algorithm.cu:97
#9 0x0000000000451445 in seq_run (detector_opts=..., input_opts=..., clusterization_opts=..., seeding_opts=...,
finding_opts=..., propagation_opts=..., performance_opts=..., accelerator_opts=...)
at /data/ssd-1tb/projects/traccc/traccc/examples/run/cuda/seq_example_cuda.cpp:382
#10 0x000000000045516e in main (argc=<optimized out>, argv=<optimized out>)
at /data/ssd-1tb/projects/traccc/traccc/examples/run/cuda/seq_example_cuda.cpp:560
(gdb)
Does this only happen for cuda? or also for cpu?
Fair question. But the host version does run through.
[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10
Running Full Tracking Chain on the Host
>>> Detector Options <<<
Detector file : geometries/odd/odd-detray_geometry_detray.json
Material file :
Surface rid file : geometries/odd/odd-detray_surface_grids_detray.json
Use detray::detector: yes
Digitization file : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
Input data format : csv
Input directory : odd/geant4_ttbar_mu200/
Number of input events : 1
Number of input events to skip: 2
>>> Clusterization Options <<<
Target cells per partition: 1024
>>> Track Seeding Options <<<
None
>>> Track Finding Options <<<
Track candidates range : 3:100
Minimum step length for the next surface: 0.5 [mm]
Maximum step counts for the next surface: 100
Maximum Chi2 : 30
Maximum branches per step: 10
Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
Constraint step size : 3.40282e+38 [mm]
Overstep tolerance : -100 [um]
Minimum mask tolerance: 1e-05 [mm]
Maximum mask tolerance: 1 [mm]
Search window : 0 x 0
Runge-Kutta tolerance : 0.0001
>>> Track Ambiguity Resolution Options <<<
Run ambiguity resolution : yes
>>> Performance Measurement Options <<<
Run performance checks: no
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
==> Statistics ...
- read 334026 cells from 16709 modules
- created 92925 measurements.
- created 92925 space points.
- created 15892 seeds
- found 21332 tracks
- fitted 21332 tracks
- resolved 4089 tracks
==> Elapsed times...
Read cells 560 ms
Clusterization 14 ms
Spacepoint formation 4 ms
Seeding 283 ms
Track params estimation 2 ms
Track finding 2144 ms
Track fitting 499 ms
Track ambiguity resolution 5065 ms
Wall time 8575 ms
[bash][Legolas]:traccc >
KF is supposed to reproduced the mathematically identical tracks found by CKF, at least that is what I intended Now I fear that this could be a bug from CKF that builds patterns wrongly, and KF gets suffer from those wrongly built pattern
Oh maybe KF is buggy now, which is not follownig the changes from #556 and #529 accordingly.
Also please make this specific data available somewhere so I can test and debug by meyself
Let's start with the important part first. I've put the ODD ttbar files here yesterday: https://cernbox.cern.ch/s/aLswvi2pNcBX9wr Just make sure that you have ~100 GB free space if you want to download it. :frowning: (~24 GB for the TGZ, and ~65 GB for the uncompressed files.) I could upload just the one problematic event as well if you'd like. :thinking:
At the same time: The plot thickens. After building the code with #568 locally fixed, in Debug mode, the bloody thing runs through. :open_mouth:
[bash][Legolas]:traccc > ./out/build/legolas/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10
Running Full Tracking Chain Using CUDA
>>> Detector Options <<<
Detector file : geometries/odd/odd-detray_geometry_detray.json
Material file :
Surface rid file : geometries/odd/odd-detray_surface_grids_detray.json
Use detray::detector: yes
Digitization file : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
Input data format : csv
Input directory : odd/geant4_ttbar_mu200/
Number of input events : 1
Number of input events to skip: 2
>>> Clusterization Options <<<
Target cells per partition: 1024
>>> Track Seeding Options <<<
None
>>> Track Finding Options <<<
Track candidates range : 3:100
Minimum step length for the next surface: 0.5 [mm]
Maximum step counts for the next surface: 100
Maximum Chi2 : 30
Maximum branches per step: 10
Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
Constraint step size : 3.40282e+38 [mm]
Overstep tolerance : -100 [um]
Minimum mask tolerance: 1e-05 [mm]
Maximum mask tolerance: 1 [mm]
Search window : 0 x 0
Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
Run performance checks: no
>>> Accelerator Options <<<
Compare with CPU results: no
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
==> Statistics ...
- read 334026 cells from 16709 modules
- created (cpu) 0 measurements
- created (cuda) 92925 measurements
- created (cpu) 0 spacepoints
- created (cuda) 92925 spacepoints
- created (cpu) 0 seeds
- created (cuda) 15892 seeds
- found (cpu) 0 tracks
- found (cuda) 21309 tracks
- fitted (cpu) 0 tracks
- fitted (cuda) 21309 tracks
==>Elapsed times...
File reading (cpu) 1413 ms
Clusterization (cuda) 94 ms
Spacepoint formation (cuda) 2 ms
Seeding (cuda) 146 ms
Track params (cuda) 5 ms
Track finding (cuda) 32652 ms
Track fitting (cuda) 11253 ms
Wall time 45587 ms
[bash][Legolas]:traccc >
I'll do some further testing next, but it very much seems to be some sort of a race condition. Which the slower-running debug binary manages to avoid. :thinking:
Unfortunately I'm none the wiser, now that I looked at the code for a bit.
I mean, the propagation code of course can get into an endless loop relatively easily.
So I even tried to "activate" the aborter that we have in traccc's fitting code.
Plus added one assertion in the only place where a memory error could relatively easily happen.
diff --git a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
index 898e6ef8..565a27ce 100644
--- a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
+++ b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
@@ -106,7 +106,7 @@ class kalman_fitter {
}
/// Individual actor states
- typename aborter::state m_aborter_state{};
+ typename aborter::state m_aborter_state{detray::unit<scalar_type>::m};
typename transporter::state m_transporter_state{};
typename interactor::state m_interactor_state{};
typename fit_actor::state m_fit_actor_state;
@@ -229,6 +229,7 @@ class kalman_fitter {
auto& track_states = fitter_state.m_fit_actor_state.m_track_states;
// Fit parameter = smoothed track parameter at the first surface
+ assert(!track_states.empty());
fit_res.fit_params = track_states[0].smoothed();
for (const auto& trk_state : track_states) {
But they made no difference for the code. It still goes into an endless loop in optimized mode, and finishes in debug mode. :confused:
Though at least I learned about nvtop along the way. :stuck_out_tongue:
Have you tried slapping a print statement on this loop to see if it loops forever? Of course if it really is a race condition the additional timing might avoid it, but could be worth a try.
Aborting with a limited path length won't help when the track is oscillating around the same surface. (I tried the same thing in the CKF and it was no-no)
I suggest killing the track that makes more than certain number of steps: https://github.com/acts-project/traccc/blob/main/core/include/traccc/finding/ckf_aborter.hpp. You can put this condition into kalman_actor
or make a dedicated aborter by yourself (e.g. kf_aborter).
But as I said, the infinite loop should not happen at all at least for traccc KF after CKF
In case it's oscillating around a surface, you could also try to either reduce the overstepping tolerance or increase the minimum step size (is that still checked correctly?). If the stepper cannot increase the step size enough after stepping onto a surface with a mini-step, we might indeed end up oscillating around that surface...
It should not happen, but pretty clearly it does. :thinking:
Adding
diff --git a/core/include/detray/propagator/propagator.hpp b/core/include/detray/propagator/propagator.hpp
index e361f4e6..3b47ec4c 100644
--- a/core/include/detray/propagator/propagator.hpp
+++ b/core/include/detray/propagator/propagator.hpp
@@ -151,8 +151,11 @@ struct propagator {
m_navigator.update(propagation, m_cfg.navigation);
// Run while there is a heartbeat
+ int i = 0;
while (propagation._heartbeat) {
+ printf("starting iteration %i\n", i++);
+
// Take the step
propagation._heartbeat &=
m_stepper.step(propagation, m_cfg.stepping);
to the code, I ended up killing my test job at:
...
starting iteration 671441
starting iteration 671442
starting iteration 671443
starting iteration 671444
starting iteration 671445
starting iteration 671446
^C
[bash][Legolas]:traccc >
It's not actually a "race condition" that I suspect here. :thinking: Since this code is not doing any cooperation between threads that I could see. What I suspect is a floating-point issue. That the CKF for whatever reason propagates one particle just differently enough using "fast math" that the KF doesn't reproduce it.
So yeah, I'm very much getting the sense that an aborter, based on the number of iterations, is the way to go here...
I am pretty aware of that it clearly happens. The context is the following: The solution should be given by fixing the bug from CKF or KF instead of adding a hacky solution in the source code.
I am pretty aware of that it clearly happens. The context is the following: The solution should be given by fixing the bug from CKF or KF instead of adding a hacky solution in the source code.
I'm very happy to leave you to it. :thinking: Would you like some additional help with getting the cells that produce this behaviour? (I could put the files one-by-one onto EOS for instance.)
What I suspect is a floating-point issue. That the CKF for whatever reason propagates one particle just differently enough using "fast math" that the KF doesn't reproduce it.
The GPU or CUDA can generate different results from the exactly same sequence of the same calculations?
Would you like some additional help with getting the cells that produce this behaviour?
I will appreciate it
Though adding an "aborter" is still something that we should consider. Since this is likely not going to be the last bug in our code. And making the code print a warning and then continue, would be much preferable to getting into endless loops on weird events.
For context: We are just trying to process O(100) events here. In the foreseeable future I hope that we'll be able to go up to processing mullions. And at that point we'll need reasonable output about errors, and not just endlessly looping jobs. Since that's exactly what we had with the NSW reconstruction as well... :frowning:
And making the code print a warning and then continue,
Sounds legit.
You can find the one problematic event here: https://cernbox.cern.ch/s/jQ3TYzcLX0cAgQz
A "standard setup" of the main branch (with the latest data file downloaded for the ODD geometry files that I've been using), with this one event, should reproduce this endless loop, using the sort of commands that I've put a lot of into this issue.
Though I have to admit, I didn't even try it on a different NVIDIA GPU yet, just the one in my desktop. :thinking: So it's not out of the question that a little different GPU may not even reproduce the issue.
Processing every ODD ttbar event that I made in #561, I have one that makes the reconstruction run forever. :confused:
The application is just stuck on that file, with both my CPU and GPU reporting to be busy. :thinking:
I don't see this behaviour on any of the other events that I simulated. So I can imagine two things:
In the end, both of them are the same. :thinking: Since even on "bad events" we can't afford to go into an endless loop with our code.
Note that I didn't find yet which algorithm/kernel is doing it. Unfortunately attaching
cuda-gdb
to a running process is a lot more difficult than I first thought. :frowning: So I thought I'd open the issue with just this little information for now.