Closed Chamodya-ka closed 2 years ago
Thanks for the report!
Could you put the modified code into a branch in your fork of traccc? It would make it easier for us to debug this issue. (Plus it would allow us to see how exactly you've set up the vecmem::binary_page_memory_resource
object.)
seq_example_cuda - changes on lines 74-85 and 105
seq_example - changes on lines 54-60 and 75
Aight, so good news and bad news and good news. The good news is the memory resource seems to be working as intended. The bad news is that host allocation is so fast that adding an extra allocation layer actually makes it slower. Please see the following graph:
All the buddy resources are bunched up, but you can see they are all significantly faster than the plain resources, except for the host resource. In good conditions, host allocations can take as little as 10 nanoseconds, and we can't really beat that. The good news is that we don't need to, for host allocations just use it without any caching allocator.
Please note this is a pretty badly designed benchmark on my part, because the allocation-deallocation pattern is very friendly. Still, I think this explains the discrepancy we're seeing. Also, the allocation time is amortised over a varying number of iterations, which leads to very unreliable results.
All in all, while there is room to improve the performance of the binary page memory resource, but it should be considered a tool for handling expensive device allocations more than anything else. :smile:
Thanks a lot for the explanation. It makes sense to me now why cpu algorithms are slower with caching allocators. However I cannot still figure out why file IO for cuda algorithms are slower when using binary page memory compared to managed memory resource.
Just as a little update on this, I am working on an extensive rewrite that will make the buddy allocator a lot more responsive for this particular workload.
Thanks for addressing the problem, but I think there is an issue with the recent binary page memory resource.
changes to traccc_seq_example_cuda are as follows, and available here
#include <vecmem/memory/binary_page_memory_resource.hpp>
...
vecmem::binary_page_memory_resource bp_mr(mng_mr);
traccc::cuda::seeding_algorithm sa_cuda(bp_mr);
traccc::cuda::track_params_estimation tp_cuda(bp_mr);
traccc::cuda::clusterization_algorithm ca_cuda(bp_mr);
...
traccc::cell_container_types::host cells_per_event =
traccc::read_cells_from_event(event, i_cfg.cell_directory,
common_opts.input_data_format,
surface_transforms, digi_cfg, bp_mr);
the program crashes with the following
$ build/bin/traccc_seq_example_cuda --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/ --events=10 --run_cpu=0 --input-binary
Running build/bin/traccc_seq_example_cuda tml_detector/trackml-detector.csv tml_full/ttbar_mu200/ 10
*** Break *** segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
Thread 4 (Thread 0x7fc98d1d0700 (LWP 5726) "traccc_seq_exam"):
#0 0x00007fc9aff4c065 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fc98d1c6ec0, expected=0, futex_word=0x55a9aa882518) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 __pthread_cond_wait_common (abstime=0x7fc98d1c6ec0, mutex=0x55a9aa887b18, cond=0x55a9aa8824f0) at pthread_cond_wait.c:539
#2 __pthread_cond_timedwait (cond=0x55a9aa8824f0, mutex=0x55a9aa887b18, abstime=0x7fc98d1c6ec0) at pthread_cond_wait.c:667
#3 0x00007fc994380eaf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fc994434f16 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007fc9aff456db in start_thread (arg=0x7fc98d1d0700) at pthread_create.c:463
#6 0x00007fc9ae7a761f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 3 (Thread 0x7fc98d9d1700 (LWP 5725) "cuda-EvtHandlr"):
#0 0x00007fc9ae79abb9 in __GI___poll (fds=0x7fc980000bd0, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fc99443a3c1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fc994445d3a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fc994434f16 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fc9aff456db in start_thread (arg=0x7fc98d9d1700) at pthread_create.c:463
#5 0x00007fc9ae7a761f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7fc98e1d2700 (LWP 5724) "cuda-EvtHandlr"):
#0 0x00007fc9ae79abb9 in __GI___poll (fds=0x55a9aa0b74c0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fc99443a3c1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fc994445d3a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007fc994434f16 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007fc9aff456db in start_thread (arg=0x7fc98e1d2700) at pthread_create.c:463
#5 0x00007fc9ae7a761f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7fc9b1c54000 (LWP 5706) "traccc_seq_exam"):
#0 0x00007fc9ae76a372 in __GI___waitpid (pid=5729, stat_loc=stat_loc
entry=0x7ffe394c7628, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1 0x00007fc9ae6d5047 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:149
#2 0x00007fc9b045cfe3 in TUnixSystem::StackTrace() () from /usr/local/lib/ROOT/root/lib/libCore.so.6.26
#3 0x00007fc9b045fc75 in TUnixSystem::DispatchSignals(ESignals) () from /usr/local/lib/ROOT/root/lib/libCore.so.6.26
#4 <signal handler called>
#5 0x000055a9a801389c in std::pmr::polymorphic_allocator<traccc::cell>::construct<traccc::cell> (__p=<optimized out>, this=<optimized out>) at /usr/include/c++/9/new:174
#6 std::allocator_traits<std::pmr::polymorphic_allocator<traccc::cell> >::_S_construct<traccc::cell> (__a=..., __p=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:244
#7 std::allocator_traits<std::pmr::polymorphic_allocator<traccc::cell> >::construct<traccc::cell> (__a=..., __p=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:350
#8 std::__uninitialized_default_n_a<traccc::cell*, unsigned long, std::pmr::polymorphic_allocator<traccc::cell> > (__alloc=..., __n=<optimized out>, __first=0x7fc96c580000) at /usr/include/c++/9/bits/stl_uninitialized.h:649
#9 std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> >::_M_default_append (this=0x7fc96c480000, __n=961325776) at /usr/include/c++/9/bits/vector.tcc:656
#10 0x000055a9a8023444 in std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> >::resize (__new_size=<optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:937
#11 vecmem::copy::operator()<traccc::cell, traccc::cell, std::pmr::polymorphic_allocator<std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> > >, std::pmr::polymorphic_allocator<traccc::cell> > (this=this
entry=0x7ffe394ca550, from_buffer=..., to_vec=std::vector of length 13565, capacity 13565 = {...}, cptype=cptype
entry=vecmem::copy::type::unknown) at /home/chamodya/myrepos/30-06/traccc/build/_deps/vecmem-src/core/include/vecmem/utils/impl/copy.ipp:340
#12 0x000055a9a802399b in traccc::read_binary<traccc::host_container<traccc::cell_module, traccc::cell> > (in_name="../../recent/traccc/data//tml_full/ttbar_mu200/event000000000-cells.dat", copy=..., resource=...) at /home/chamodya/myrepos/30-06/traccc/io/include/traccc/io/binary.hpp:114
#13 0x000055a9a8024d24 in traccc::read_cells_from_event(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, traccc::data_format const&, traccc::module_map<unsigned long, algebra::cmath::transform3<algebra::cmath::matrix::actor<unsigned long, std::array, algebra::array::matrix_type, float, algebra::cmath::matrix::determinant::actor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::matrix::determinant::cofactor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>>, algebra::cmath::matrix::determinant::hard_coded<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>, 2ul, 4ul> >, algebra::cmath::matrix::inverse::actor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::matrix::inverse::cofactor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>>, algebra::cmath::matrix::inverse::hard_coded<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>, 2ul, 4ul> >, algebra::cmath::element_getter<unsigned long, std::array, float>, algebra::cmath::block_getter<unsigned long, std::array, float> > > >, Acts::GeometryHierarchyMap<traccc::digitization_config>, std::pmr::memory_resource&) (event=event
entry=0, cells_directory="tml_full/ttbar_mu200/", data_format=
0x7ffe394cba44: traccc::binary, surface_transforms=..., digi_config=..., resource=...) at /home/chamodya/myrepos/30-06/traccc/io/include/traccc/io/reader.hpp:95
#14 0x000055a9a80028ff in seq_run (i_cfg=..., common_opts=..., run_cpu=false) at /home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:106
#15 0x000055a9a7ffadb4 in main (argc=7, argv=0x7ffe394cbd88) at /home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:368
===========================================================
The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5 0x000055a9a801389c in std::pmr::polymorphic_allocator<traccc::cell>::construct<traccc::cell> (__p=<optimized out>, this=<optimized out>) at /usr/include/c++/9/new:174
#6 std::allocator_traits<std::pmr::polymorphic_allocator<traccc::cell> >::_S_construct<traccc::cell> (__a=..., __p=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:244
#7 std::allocator_traits<std::pmr::polymorphic_allocator<traccc::cell> >::construct<traccc::cell> (__a=..., __p=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:350
#8 std::__uninitialized_default_n_a<traccc::cell*, unsigned long, std::pmr::polymorphic_allocator<traccc::cell> > (__alloc=..., __n=<optimized out>, __first=0x7fc96c580000) at /usr/include/c++/9/bits/stl_uninitialized.h:649
#9 std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> >::_M_default_append (this=0x7fc96c480000, __n=961325776) at /usr/include/c++/9/bits/vector.tcc:656
#10 0x000055a9a8023444 in std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> >::resize (__new_size=<optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:937
#11 vecmem::copy::operator()<traccc::cell, traccc::cell, std::pmr::polymorphic_allocator<std::vector<traccc::cell, std::pmr::polymorphic_allocator<traccc::cell> > >, std::pmr::polymorphic_allocator<traccc::cell> > (this=this
entry=0x7ffe394ca550, from_buffer=..., to_vec=std::vector of length 13565, capacity 13565 = {...}, cptype=cptype
entry=vecmem::copy::type::unknown) at /home/chamodya/myrepos/30-06/traccc/build/_deps/vecmem-src/core/include/vecmem/utils/impl/copy.ipp:340
#12 0x000055a9a802399b in traccc::read_binary<traccc::host_container<traccc::cell_module, traccc::cell> > (in_name="../../recent/traccc/data//tml_full/ttbar_mu200/event000000000-cells.dat", copy=..., resource=...) at /home/chamodya/myrepos/30-06/traccc/io/include/traccc/io/binary.hpp:114
#13 0x000055a9a8024d24 in traccc::read_cells_from_event(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, traccc::data_format const&, traccc::module_map<unsigned long, algebra::cmath::transform3<algebra::cmath::matrix::actor<unsigned long, std::array, algebra::array::matrix_type, float, algebra::cmath::matrix::determinant::actor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::matrix::determinant::cofactor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>>, algebra::cmath::matrix::determinant::hard_coded<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>, 2ul, 4ul> >, algebra::cmath::matrix::inverse::actor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::matrix::inverse::cofactor<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>>, algebra::cmath::matrix::inverse::hard_coded<unsigned long, algebra::array::matrix_type, float, algebra::cmath::element_getter<unsigned long, std::array, float>, 2ul, 4ul> >, algebra::cmath::element_getter<unsigned long, std::array, float>, algebra::cmath::block_getter<unsigned long, std::array, float> > > >, Acts::GeometryHierarchyMap<traccc::digitization_config>, std::pmr::memory_resource&) (event=event
entry=0, cells_directory="tml_full/ttbar_mu200/", data_format=
0x7ffe394cba44: traccc::binary, surface_transforms=..., digi_config=..., resource=...) at /home/chamodya/myrepos/30-06/traccc/io/include/traccc/io/reader.hpp:95
#14 0x000055a9a80028ff in seq_run (i_cfg=..., common_opts=..., run_cpu=false) at /home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:106
#15 0x000055a9a7ffadb4 in main (argc=7, argv=0x7ffe394cbd88) at /home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:368
===========================================================
Moreover, FYI when using bpmr for only algorithms and not file IO (file io with mng_mr), There is a illegal memory write in component connection kernel :thinking:
$ compute-sanitizer build/bin/traccc_seq_example_cuda --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/ --events=10 --run_cpu=0 --input-binary
...
========= Invalid __global__ write of size 8 bytes
========= at 0x390 in traccc::cuda::kernels::connect_components(traccc::container_view<const traccc::cell_module, const traccc::cell>, vecmem::data::jagged_vector_view<unsigned int>, vecmem::data::vector_view<unsigned long>, vecmem::data::vector_view<const std::pair<unsigned long, unsigned long>>, traccc::container_view<unsigned long, traccc::cell>)
========= by thread (61,0,0) in block (149,0,0)
========= Address 0x7fd57433b780 is out of bounds
========= and is 23615223169 bytes after the nearest allocation at 0x7fcff4a00000 of size 512 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x21740c]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart803 [0x2ac4b]
========= in /home/chamodya/myrepos/30-06/traccc/build/lib/x86_64-linux-gnu/libtraccc_cuda.so.0
========= Host Frame:cudaLaunchKernel [0x85aa8]
========= in /home/chamodya/myrepos/30-06/traccc/build/lib/x86_64-linux-gnu/libtraccc_cuda.so.0
========= Host Frame:/tmp/tmpxft_00007c4a_00000000-6_clusterization_algorithm.cudafe1.stub.c:1:__device_stub__ZN6traccc4cuda7kernels18connect_componentsENS_14container_viewIKNS_11cell_moduleEKNS_4cellEEEN6vecmem4data18jagged_vector_viewIjEENS9_11vector_viewImEENSC_IKSt4pairImmEEENS2_ImS5_EE(traccc::container_view<traccc::cell_module const, traccc::cell const> const&, vecmem::data::jagged_vector_view<unsigned int>&, vecmem::data::vector_view<unsigned long>&, vecmem::data::vector_view<std::pair<unsigned long, unsigned long> const>&, traccc::container_view<unsigned long, traccc::cell>&) [0x218f8]
========= in /home/chamodya/myrepos/30-06/traccc/build/lib/x86_64-linux-gnu/libtraccc_cuda.so.0
========= Host Frame:/home/chamodya/myrepos/30-06/traccc/device/cuda/src/clusterization/clusterization_algorithm.cu:64:traccc::cuda::kernels::connect_components(traccc::container_view<traccc::cell_module const, traccc::cell const>, vecmem::data::jagged_vector_view<unsigned int>, vecmem::data::vector_view<unsigned long>, vecmem::data::vector_view<std::pair<unsigned long, unsigned long> const>, traccc::container_view<unsigned long, traccc::cell>) [0x21949]
========= in /home/chamodya/myrepos/30-06/traccc/build/lib/x86_64-linux-gnu/libtraccc_cuda.so.0
========= Host Frame:/home/chamodya/myrepos/30-06/traccc/device/cuda/src/clusterization/clusterization_algorithm.cu:243:traccc::cuda::clusterization_algorithm::operator()(traccc::host_container<traccc::cell_module, traccc::cell> const&) const [0x22ecd]
========= in /home/chamodya/myrepos/30-06/traccc/build/lib/x86_64-linux-gnu/libtraccc_cuda.so.0
========= Host Frame:/home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:148:seq_run(traccc::full_tracking_input_config const&, traccc::common_options const&, bool) [0x15a50]
========= in /home/chamodya/myrepos/30-06/traccc/build/bin/traccc_seq_example_cuda
========= Host Frame:/home/chamodya/myrepos/30-06/traccc/examples/run/cuda/seq_example_cuda.cpp:352:main [0xddb4]
========= in /home/chamodya/myrepos/30-06/traccc/build/bin/traccc_seq_example_cuda
========= Host Frame:__libc_start_main [0x21c87]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0xde8a]
========= in /home/chamodya/myrepos/30-06/traccc/build/bin/traccc_seq_example_cuda
Interesting, then our tests are clearly not thorough enough. I'll take a look later. :smile:
Hi @Chamodya-ka, the issue should be resolved now.
Great it works now. The File IO times are considerably better however still slower than with mng_mr.
mng_mr
$ build/bin/traccc_seq_example_cuda --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/ --events=20 --input-binary
Running build/bin/traccc_seq_example_cuda tml_detector/trackml-detector.csv tml_full/ttbar_mu200/ 20
FIO 0 0.157383
FIO 1 0.0516271
FIO 2 0.0503711
FIO 3 0.0501587
FIO 4 0.0534
FIO 5 0.0513502
FIO 6 0.0545315
FIO 7 0.0543926
FIO 8 0.0563914
FIO 9 0.0534329
FIO 10 0.0507833
FIO 11 0.052829
FIO 12 0.0507948
FIO 13 0.0540739
FIO 14 0.0525085
FIO 15 0.0502023
FIO 16 0.0550176
FIO 17 0.0527754
FIO 18 0.054672
FIO 19 0.0579246
==> Statistics ...
- read 1817250 spacepoints from 265721 modules
- created 6329916 cells
- created 1817250 meaurements
- created 1817250 spacepoints
- created (cpu) 0 seeds
- created (cuda) 292502 seeds
==> Elpased time ...
wall time 7.92729
file reading (cpu) 1.16462
clusterization_time (cpu) 0.333311
spacepoint_formation_time (cpu) 0.0249196
clusterization and sp formation (cuda) 0.62026
seeding_time (cpu) 1.0135e-05
seeding_time (cuda) 0.724654
tr_par_esti_time (cpu) 2.757e-06
tr_par_esti_time (cuda) 0.0445055
bp_mr
build/bin/traccc_seq_example_cuda --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/ --events=20 --input-binary
Running build/bin/traccc_seq_example_cuda tml_detector/trackml-detector.csv tml_full/ttbar_mu200/ 20
FIO 0 0.327628
FIO 1 0.189136
FIO 2 0.185138
FIO 3 0.181468
FIO 4 0.18743
FIO 5 0.181726
FIO 6 0.187076
FIO 7 0.184497
FIO 8 0.176061
FIO 9 0.173971
FIO 10 0.168896
FIO 11 0.176432
FIO 12 0.172278
FIO 13 0.175711
FIO 14 0.166655
FIO 15 0.165695
FIO 16 0.175379
FIO 17 0.170573
FIO 18 0.173659
FIO 19 0.174399
==> Statistics ...
- read 1817250 spacepoints from 265721 modules
- created 6329916 cells
- created 1817250 meaurements
- created 1817250 spacepoints
- created (cpu) 0 seeds
- created (cuda) 292504 seeds
==> Elpased time ...
wall time 8.32788
file reading (cpu) 3.69381
clusterization_time (cpu) 0.345183
spacepoint_formation_time (cpu) 0.0248178
clusterization and sp formation (cuda) 0.696769
seeding_time (cpu) 1.0067e-05
seeding_time (cuda) 0.708191
tr_par_esti_time (cpu) 2.232e-06
tr_par_esti_time (cuda) 0.0475086
`
This pretty much as expected. Glad to know it's working now; I'll close this issue now and we can perhaps iterate on it later. :slightly_smiling_face:
Using binary page memory for file IO and algorithm IO in Traccc consumes more time than without using any downstream memory resource, this is contradicting to what is expected. The times get worse after each event. Provided below are file IO times for
traccc_seq_example_cuda
computing 10 events using managed memory resource and binary page memory resource using managed memory as upstream.Managed memory resource
Binary page memory resource
In addition, when using binary page memory resource with host memory resource as upstream for traccc_seq_example (cpu) algorithm IO takes a really long time. Below are 2 events
host memory resource
binary page memory resource