JeffersonLab / JANA2

Multi-threaded HENP Event Reconstruction
https://jeffersonlab.github.io/JANA2/
Other
6 stars 9 forks source link

Error: m_topology->running_arrow_count >= 0 #227

Closed faustus123 closed 4 months ago

faustus123 commented 1 year ago

The jana built-in benchmark was run on ejfat-5.jlab.org and crashed with the error printed below. I saw this hit the assert when setting to 72 threads (machine reports 128 cores so this was in the middle of the test). Note that I ran it a second time and it hit the assert when setting to 73 threads so it is not perfectly reproducible.

The command run was:

jana -Pplugins=JTest -b

The machine specs are:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    16
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7763 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         1500.000
CPU max MHz:                     3529.0520
CPU min MHz:                     1500.0000
BogoMIPS:                        4899.90
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB

Last part of output, including error message

nthreads=69  rate=120.958Hz  (avg = 143.939 +/- 8.57713 Hz)
nthreads=69  rate=159.95Hz  (avg = 146.607 +/- 7.55132 Hz)
nthreads=69  rate=149.952Hz  (avg = 147.085 +/- 6.48766 Hz)
nthreads=69  rate=129.962Hz  (avg = 144.945 +/- 6.01944 Hz)
nthreads=69  rate=159.932Hz  (avg = 146.61 +/- 5.57619 Hz)
nthreads=69  rate=130.958Hz  (avg = 145.045 +/- 5.23364 Hz)
nthreads=69  rate=148.954Hz  (avg = 145.4 +/- 4.76991 Hz)
nthreads=69  rate=159.903Hz  (avg = 146.609 +/- 4.52294 Hz)
nthreads=69  rate=141.951Hz  (avg = 146.25 +/- 4.18919 Hz)
nthreads=69  rate=137.957Hz  (avg = 145.658 +/- 3.93162 Hz)
nthreads=69  rate=159.91Hz  (avg = 146.608 +/- 3.78258 Hz)
Setting NTHREADS = 70 ...
[INFO] Scaling to 70 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
[INFO] JArrowProcessingController: scale(): All workers are stopped
[INFO] JArrowProcessingController: scale(): Restarting 70 workers
nthreads=70  rate=159.955Hz  (avg = 159.955 +/- -nan Hz)
nthreads=70  rate=159.916Hz  (avg = 159.935 +/- 0.0103851 Hz)
nthreads=70  rate=156.941Hz  (avg = 158.937 +/- 0.81497 Hz)
nthreads=70  rate=122.96Hz  (avg = 149.943 +/- 7.81328 Hz)
nthreads=70  rate=159.953Hz  (avg = 151.945 +/- 6.50205 Hz)
nthreads=70  rate=159.903Hz  (avg = 153.271 +/- 5.55201 Hz)
nthreads=70  rate=119.955Hz  (avg = 148.512 +/- 6.48565 Hz)
nthreads=70  rate=159.953Hz  (avg = 149.942 +/- 5.83049 Hz)
nthreads=70  rate=159.923Hz  (avg = 151.051 +/- 5.28708 Hz)
nthreads=70  rate=130.956Hz  (avg = 149.041 +/- 5.12606 Hz)
nthreads=70  rate=148.948Hz  (avg = 149.033 +/- 4.66006 Hz)
nthreads=70  rate=159.875Hz  (avg = 149.936 +/- 4.35843 Hz)
nthreads=70  rate=119.946Hz  (avg = 147.629 +/- 4.59331 Hz)
nthreads=70  rate=159.933Hz  (avg = 148.508 +/- 4.34847 Hz)
nthreads=70  rate=158.902Hz  (avg = 149.201 +/- 4.11341 Hz)
Setting NTHREADS = 71 ...
[INFO] Scaling to 71 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
[INFO] JArrowProcessingController: scale(): All workers are stopped
[INFO] JArrowProcessingController: scale(): Restarting 71 workers
nthreads=71  rate=119.959Hz  (avg = 119.959 +/- 0.0187951 Hz)
nthreads=71  rate=159.951Hz  (avg = 139.955 +/- 14.1393 Hz)
nthreads=71  rate=159.918Hz  (avg = 146.609 +/- 10.88 Hz)
nthreads=71  rate=138.924Hz  (avg = 144.688 +/- 8.32788 Hz)
nthreads=71  rate=140.946Hz  (avg = 143.939 +/- 6.69584 Hz)
nthreads=71  rate=159.901Hz  (avg = 146.6 +/- 6.08542 Hz)
nthreads=71  rate=146.951Hz  (avg = 146.65 +/- 5.21628 Hz)
nthreads=71  rate=132.966Hz  (avg = 144.939 +/- 4.83657 Hz)
nthreads=71  rate=159.946Hz  (avg = 146.607 +/- 4.57759 Hz)
nthreads=71  rate=159.921Hz  (avg = 147.938 +/- 4.3091 Hz)
nthreads=71  rate=119.973Hz  (avg = 145.396 +/- 4.60666 Hz)
nthreads=71  rate=159.951Hz  (avg = 146.609 +/- 4.37954 Hz)
nthreads=71  rate=157.948Hz  (avg = 147.481 +/- 4.1286 Hz)
nthreads=71  rate=121.967Hz  (avg = 145.659 +/- 4.21678 Hz)
nthreads=71  rate=159.958Hz  (avg = 146.612 +/- 4.04198 Hz)
Setting NTHREADS = 72 ...
[INFO] Scaling to 72 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
jana: /home/davidl/work/2023.06.26.JANA2/JANA2/src/libraries/JANA/Engine/JScheduler.cc:74: JArrow* JScheduler::next_assignment(uint32_t, JArrow*, JArrowMetrics::Status): Assertion `m_topology->running_arrow_count >= 0' failed.
Aborted (core dumped)
(venv) davidl@ejfat-5:~/work/2023.06.26.JANA2$ 
nathanwbrei commented 10 months ago

I think I've addressed the root cause with #259, although I'll be a lot more confident after doing performance/stress testing

nathanwbrei commented 10 months ago

I did some performance testing on a farm18 node and didn't see any crashes. Note that in order to reach the ~150Hz that you did, I needed to set -Pjtest:parser_ms=0 as per #106.