acts-project / traccc

Demonstrator tracking chain on accelerators
Mozilla Public License 2.0
29 stars 46 forks source link

(Asynchronous) Error in CUDA Clusterization, main branch (2024.05.04.) #565

Open krasznaa opened 4 months ago

krasznaa commented 4 months ago

The ODD ttbar simulation just keeps on giving...

While I can process the low-ish $\mu$ samples successfully, at and above $\mu$ = 100 I run into:

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu100/ --input-events=2

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu100/
  Number of input events        : 2
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 9843 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu100/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 11494 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu100/event000000001-cells.csv
terminate called after throwing an instance of 'std::runtime_error'
  what():  /data/ssd-1tb/projects/traccc/traccc/out/build/sycl/_deps/vecmem-src/cuda/src/memory/device_memory_resource.cpp:60 Failed to execute: cudaFree(p) (an illegal memory access was encountered)
Aborted (core dumped)
[bash][Legolas]:traccc >

There seem to be multiple things going wrong actually... :thinking:

  1. There is something wrong with the partitioning logic. :frowning: Since in a debug build, I run into this assertion out of the gate:
WARNING: @traccc::io::csv::read_cells: 9843 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu100/event000000000-cells.csv

Assertion failed.

Thread 1 "traccc_seq_exam" received signal CUDA_EXCEPTION_12, Warp Assert.
[Switching focus to CUDA kernel 0, grid 1, block (105,0,0), thread (0,0,0), device 0, sm 7, warp 5, lane 0]
0x00007fffd7274ea0 in __assert_fail ()
(cuda-gdb) bt
#0  0x00007fffd7274ea0 in __assert_fail ()
#1  0x00007fffd7265db0 in traccc::device::ccl_kernel<traccc::cuda::barrier> (threadId=0, blckDim=128, blockId=105, cells_view=..., modules_view=..., max_cells_per_partition=1536, target_cells_per_partition=1024, partition_start=<error reading variable: Unknown storage specifier (read) 0x10000>, partition_end=<error reading variable: Unknown storage specifier (read) 0x10000>, 
    outi=<error reading variable: Unknown storage specifier (read) 0x10000>, f_view=..., gf_view=..., barrier=..., measurements_view=..., cell_links=...) at /data/ssd-1tb/projects/traccc/traccc/device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp:218
#2  0x00007fffd7262870 in traccc::cuda::kernels::ccl_kernel<<<(187,1,1),(128,1,1)>>> (cells_view=..., modules_view=..., max_cells_per_partition=1536, target_cells_per_partition=1024, measurements_view=..., cell_links=...) at /data/ssd-1tb/projects/traccc/traccc/device/cuda/src/clusterization/clusterization_algorithm.cu:42
(cuda-gdb)

So https://github.com/acts-project/traccc/blob/main/device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp#L153-L197 makes some mistake with setting up the partition_start and partition_end values. :thinking:

  1. While looking at the algorithm, I also found a very unsettling thing...

https://github.com/acts-project/traccc/blob/main/device/cuda/src/clusterization/clusterization_algorithm.cu#L80-L100

The silly cell_links buffer is allowed to be destroyed before the (asynchronous) kernel would stop running. :frowning:

The latter one is fully on me, I was the one who introduced this behaviour in #545, mea culpa. But the first one I could do some help / insights with.

Pinging @beomki-yeo for info. :wink:

krasznaa commented 4 months ago

In a dumb way I can make clusterization succeed with:

diff --git a/device/common/include/traccc/clusterization/device/ccl_kernel.hpp b/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
index a61ba7a4..473dc36d 100644
--- a/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
+++ b/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
@@ -29,7 +29,7 @@ namespace details {
 using index_t = unsigned short;

 static constexpr int TARGET_CELLS_PER_THREAD = 8;
-static constexpr int MAX_CELLS_PER_THREAD = 12;
+static constexpr int MAX_CELLS_PER_THREAD = 16;

 /// Helper struct for calculating some of the input parameters of @c ccl_kernel
 struct ccl_kernel_helper {
@@ -46,7 +46,7 @@ struct ccl_kernel_helper {
         max_cells_per_partition =
             (target_cells_per_partition * MAX_CELLS_PER_THREAD +
              TARGET_CELLS_PER_THREAD - 1) /
-            TARGET_CELLS_PER_THREAD;
+            TARGET_CELLS_PER_THREAD * 2;
         threads_per_partition =
             (target_cells_per_partition + TARGET_CELLS_PER_THREAD - 1) /
             TARGET_CELLS_PER_THREAD;

Though I'm really doing this in a very naive / dumb way at the moment. Do we expect that really busy pixel modules would result in too many cells getting assigned to threads / threadblocks with the current logic of the algorithm? :thinking:

Unfortunately once this is solved, the debug build then fails on:

Detector check: OK
WARNING: @traccc::io::csv::read_cells: 9843 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu100/event000000000-cells.csv

Assertion failed.

Thread 1 "traccc_seq_exam" received signal CUDA_EXCEPTION_12, Warp Assert.
[Switching focus to CUDA kernel 5, grid 45, block (57,0,0), thread (25,0,0), device 0, sm 47, warp 1, lane 25]
0x00007fff90f695a0 in __assert_fail ()
(cuda-gdb) bt
#0  0x00007fff90f695a0 in __assert_fail ()
#1  0x00007ffeebd33a00 in bool detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>::step<detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::state>(detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::state&, detray::stepping::config<float> const&) ()
#2  0x00007fff2f7d9020 in bool detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::propagate_sync<detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::state, std::tuple<detray::pathlimit_aborter::state&, detray::parameter_transporter<detray::cmath<float> >::state&, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >::state&, detray::pointwise_material_interactor<detray::cmath<float> >::state&, traccc::ckf_aborter::state&> >(detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::state&, std::tuple<detray::pathlimit_aborter::state&, detray::parameter_transporter<detray::cmath<float> >::state&, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >::state&, detray::pointwise_material_interactor<detray::cmath<float> >::state&, traccc::ckf_aborter::state&>&&) ()
#3  0x00007fff2f7d4bd0 in void traccc::cuda::kernels::propagate_to_next_surface<detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >, covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, traccc::finding_config<float> >(traccc::finding_config<float>, detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::detector_type::view_type, covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, vecmem::data::jagged_vector_view<detray::propagator<detray::rk_stepper<covfie::field_view<covfie::backend::constant<covfie::vector::vector_d<float, 3ul>, covfie::vector::vector_d<float, 3ul> > >, detray::cmath<float>, detray::constrained_step<detray::darray>, detray::stepper_rk_policy, detray::stepping::void_inspector, detray::darray>, detray::navigator<detray::detector<detray::default_metadata, detray::container_types<vecmem::device_vector, detray::tuple, detray::darray, vecmem::jagged_device_vector, detray::dmap> > const, detray::navigation::void_inspector, detray::intersection2D<detray::surface_descriptor<detray::detail::typed_index<detray::default_metadata::mask_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, detray::detail::typed_index<detray::default_metadata::material_ids, unsigned int, unsigned int, 4026531840u, 268435455u>, unsigned int, unsigned short>, detray::cmath<float> > >, detray::actor_chain<std::tuple, detray::pathlimit_aborter, detray::parameter_transporter<detray::cmath<float> >, traccc::interaction_register<detray::pointwise_material_interactor<detray::cmath<float> > >, detray::pointwise_material_interactor<detray::cmath<float> >, traccc::ckf_aborter> >::intersection_type>, vecmem::data::vector_view<detray::bound_track_parameters<detray::cmath<float> > const>, vecmem::data::vector_view<traccc::candidate_link const>, unsigned int, unsigned int const&, vecmem::data::vector_view<detray::bound_track_parameters<detray::cmath<float> > >, vecmem::data::vector_view<unsigned int>, vecmem::data::vector_view<thrust::pair<int, unsigned int> >, vecmem::data::vector_view<unsigned int>, unsigned int&)<<<(111,1,1),(64,1,1)>>> ()
(cuda-gdb)

But I'll open a separate ticket about that one, once I understood a bit better what is going on there.

krasznaa commented 4 months ago

So... :thinking: By doing

diff --git a/device/common/include/traccc/clusterization/device/ccl_kernel.hpp b/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
index a61ba7a4..2720bab9 100644
--- a/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
+++ b/device/common/include/traccc/clusterization/device/ccl_kernel.hpp
@@ -29,7 +29,7 @@ namespace details {
 using index_t = unsigned short;

 static constexpr int TARGET_CELLS_PER_THREAD = 8;
-static constexpr int MAX_CELLS_PER_THREAD = 12;
+static constexpr int MAX_CELLS_PER_THREAD = 32;

 /// Helper struct for calculating some of the input parameters of @c ccl_kernel
 struct ccl_kernel_helper {

, I can make CUDA process the $\mu$ = 300 ttbar events as well. As long as I don't enable assertions...

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu300/ --input-events=10 --input-skip=0

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu300/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 25530 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 31945 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 26973 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 27856 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 24369 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 26512 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 27034 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 25225 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 25553 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 22854 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu300/event000000009-cells.csv
==> Statistics ... 
- read    5572599 cells from 179427 modules
- created (cpu)  0 measurements     
- created (cuda)  1517631 measurements     
- created (cpu)  0 spacepoints     
- created (cuda) 1517631 spacepoints     
- created  (cpu) 0 seeds
- created (cuda) 372048 seeds
- found (cpu)    0 tracks
- found (cuda)   693397 tracks
- fitted (cpu)   0 tracks
- fitted (cuda)  693397 tracks
==>Elapsed times...
           File reading  (cpu)  9997 ms
         Clusterization (cuda)  76 ms
   Spacepoint formation (cuda)  3 ms
                Seeding (cuda)  109 ms
           Track params (cuda)  2 ms
          Track finding (cuda)  2046 ms
          Track fitting (cuda)  551 ms
                     Wall time  13038 ms
[bash][Legolas]:traccc >

It's not great that we have such a fragile hardcoded maximum in our code, but for today this will have to do...

stephenswat commented 4 months ago

Yes, if there are too many cells before a viable partition point, then the code is expected to crash. The envisioned way out here is to do some on-device memory allocation (or to use some global memory scratch space) to process the excess cells, but this is not currently implemented as the probability of failure is astronomically small (this for a maximum partition size of 2048):

image

I doubt that's what's going on here, though. Unless the occupancy here is much higher than 1%.