bsc-performance-tools / extrae

Instrumentation framework to generate execution traces of the most used parallel runtimes.
https://tools.bsc.es/extrae
GNU Lesser General Public License v2.1
58 stars 35 forks source link

mpi2prv: merging ends with segmentation fault #69

Open vineetsoni opened 1 year ago

vineetsoni commented 1 year ago

Environment:

Extrae-4.0.1 built with GCC 10.2.1, Intel MPI 2021.5, PAPI 6.0.0.1 and Libunwind-1.6.2 on AlmaLinux 8.5

Execution command

mpi2prv -syn -f TRACE.mpits -e <exe> -o <output>.prv

Error log

mpi2prv: Error! File -syn does not contain a valid extension!. Skipping.
mpi2prv: Retrieving hardware counters definitions for ptask 1 from global SYM.
mpi2prv: A total of 6 symbols were imported from TRACE.sym file
mpi2prv: 0 function symbols imported
mpi2prv: 6 HWC counter descriptions imported
merger: Output trace format is: Paraver
merger: Extrae 4.0.1
mpi2prv: Assigned nodes < myhostname >
mpi2prv: Assigned size per processor < <1 Mbyte >
mpi2prv: File /u/vinson3z/Downloads/Gearbox_explicit_20220722/extrae-results/set-0/TRACE@myhostname.0003064535000036000000.mpit is object 1.37.1 on node myhostname assigned to processor 0
mpi2prv: Time synchronization has been turned off
mpi2prv: Checking for target directory existence... exists, ok!
mpi2prv: Selected output trace format is Paraver
mpi2prv: Stored trace format is Paraver
mpi2prv: Enabling Time Synchronization (Node).
WARNING: TimeSync_CalculateLatencies: Task 0 was not initialized. Synchronization disabled!
mpi2prv: Circular buffer enabled at tracing time? NO
mpi2prv: Parsing intermediate files
mpi2prv: Progress 1 of 2 ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Processor 0 succeeded to translate its assigned files
mpi2prv: Elapsed time translating files: 0 hours 0 minutes 0 seconds
mpi2prv: Elapsed time sorting addresses: 0 hours 0 minutes 0 seconds
mpi2prv: Generating tracefile (intermediate buffers of 6710784 events)
         This process can take a while. Please, be patient.
mpi2prv: Progress 2 of 2 ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Elapsed time merge step: 0 hours 0 minutes 0 seconds
mpi2prv: Resulting tracefile occupies 126494 bytes
mpi2prv: Removing temporal files... done
mpi2prv: Elapsed time removing temporal files: 0 hours 0 minutes 0 seconds
Segmentation fault (core dumped)

Error backtrace (gdb)

0x000000000041a614 in ObjectTable_dumpAddresses (fd=fd@entry=0x5e3210, eventstart=41000001, eventstart@entry=41000000) at ../../../src/merger/common/object_tree.c:294
294                             for (_address = 0; _address < task_info->binary_objects[0].nDataSymbols; _address++)
Missing separate debuginfos, use: dnf debuginfo-install zlib-1.2.11-18.el8_5.x86_64
(gdb) bt
#0  0x000000000041a614 in ObjectTable_dumpAddresses (fd=fd@entry=0x5e3210, eventstart=41000001, eventstart@entry=41000000) at ../../../src/merger/common/object_tree.c:294
#1  0x000000000040d922 in Labels_GeneratePCFfile (name=name@entry=0x7fffffff73a0 "sphflow.pcf", options=options@entry=1041) at ../../../src/merger/paraver/labels.c:1066
#2  0x0000000000410fa0 in Paraver_ProcessTraceFiles (nfiles=1, files=0x5d42f0, num_appl=<optimized out>, NodeCPUinfo=NodeCPUinfo@entry=0x5d5b60, numtasks=numtasks@entry=1,
    taskid=taskid@entry=0) at ../../../src/merger/paraver/trace_to_prv.c:678
#3  0x00000000004046c3 in merger_post (numtasks=numtasks@entry=1, taskid=taskid@entry=0) at ../../../src/merger/common/mpi2out.c:1485
#4  0x0000000000406337 in merger_post (numtasks=numtasks@entry=1, taskid=taskid@entry=0) at ../../../src/merger/common/mpi2out.c:1366
#5  0x0000000000403a6e in main (argc=8, argv=0x7fffffff8d68) at ../../../src/merger/merger.c:69

Info

The traces are generated without any error. The exact same error is also observed using Extrae-3.8.3.

Is there any fix for this problem? Or, is it me who's using it incorrectly?

vineetsoni commented 1 year ago

More info on the generation of traces:

No extrae.xml file was used, but using it does not change the outcome.

Following environment variables were set before launching the trace generation:

export EXTRAE_HOME=/u/vinson3z/tools/install/extrae-4.0.1_gcc10
export EXTRAE_ON=1
export EXTRAE_COUNTERS=PAPI_L2_DCA,PAPI_L2_DCM,PAPI_L3_TCA,PAPI_L3_TCM,PAPI_TOT_CYC,PAPI_TOT_INS
export EXTRAE_INITIAL_MODE=detail
export EXTRAE_MPI_COUNTERS_ON=1
export EXTRAE_FUNCTIONS_COUNTERS_ON=1
vineetsoni commented 1 year ago

I think the problem is coming from somewhere else. If I look at my old runs of Extrae, I had as many lines in $EXTRAE_FINAL_DIR/TRACE.mpits as the no. of MPI processes.

However, from the recent runs, $EXTRAE_FINAL_DIR/TRACE.mpits has only 1 line. Although there are as many .mpit and .sym files in the $EXTRAE_DIR .

Bug

Bug in generating $EXTRAE_FINAL_DIR/TRACE.mpits from $EXTRAE_DIR/*.mpit.

Manual fix

mpi2prv works if the remaining lines in $EXTRAE_FINAL_DIR/TRACE.mpits are added manually.