bsc-performance-tools / extrae

Instrumentation framework to generate execution traces of the most used parallel runtimes.
https://tools.bsc.es/extrae
GNU Lesser General Public License v2.1
58 stars 35 forks source link

Can not get user-functions #101

Open Suiiiii opened 3 months ago

Suiiiii commented 3 months ago

Hi, Extrae developers,

I finallly installed extrae on our cluster systems however, here comes the problem the user-function can not be checked by paraver as my PC did.

And it is showing this:

WARNING: Negative value for MEMUSAGE_INUSE_EV detected (inuse=6569984+-1179709440-180640=-1173320096). Please submit a bug report.

Is that related to checking the user-functions? How to fix it?

Welcome to Extrae 4.0.4
Extrae: Parsing the configuration file (extrae_v230705.xml) begins
Extrae: Tracing package is located on /.....
Extrae: Generating intermediate files for Paraver traces.
Extrae: MPI routines will collect HW counters information.
Extrae: All MPI_Comm_* calls will be traced.
Extrae: Warning! <openmp> tag will be ignored. This library does not support OpenMP.
Extrae: Number of user functions traced (XL runtime): 28
Extrae: Number of user functions traced (GCC runtime): 11
Extrae: User Function routines will collect HW counters information.
Extrae: Warning! change-at-time time units not specified. Using seconds
Extrae: PAPI domain set to ALL for HWC set 1
Extrae: HWC set 1 contains following counters < PAPI_TOT_INS (0x80000032) PAPI_TOT_CYC (0x8000003b) PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) > - never changes
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 2, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 3, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 4, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 5, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 1, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 0, thread 0)
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
Extrae: Resource usage is enabled at flush buffer.
Extrae: Memory usage is enabled at flush buffer.
Extrae: Tracing buffer can hold 50000000 events
Extrae: Circular buffer disabled.
Extrae: Dynamic memory instrumentation is disabled.
Extrae: Basic I/O memory instrumentation is disabled.
Extrae: System calls instrumentation is disabled.
Extrae: Parsing the configuration file (extrae_v230705.xml) has ended
Extrae: Intermediate traces will be stored in /.......
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 6 tasks and 1 threads

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
Extrae: Temporal directory (/......) is shared among processes.
Extrae: Final directory (/......) is shared among processes.
Extrae: Successfully initiated with 6 tasks and 1 threads

..
.
.
.
.
.
.
.
.

Simulation is finished
.
.
.
.
.

WARNING: Negative value for MEMUSAGE_INUSE_EV detected (inuse=6569984+-1179709440-180640=-1173320096). Please submit a bug report.
Extrae: Intermediate raw trace file created :......mpit
Extrae: Intermediate raw trace file created : ......mpit
Extrae: Intermediate raw sym file created : ......sym
Extrae: Intermediate raw trace file created : .....mpit
Extrae: Intermediate raw sym file created : ......sym
Extrae: Intermediate raw sym file created : ......sym
Extrae: Deallocating memory.
Extrae: Intermediate raw trace file created :......mpit
Extrae: Application has ended. Tracing has been terminated.
Extrae: Intermediate raw sym file created : .......sym
Extrae: Intermediate raw trace file created : .....mpit
Extrae: Intermediate raw sym file created : .......sym
Extrae: Intermediate raw trace file created :......mpit
Extrae: Intermediate raw sym file created : ......sym

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2422615 RUNNING AT node0533
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 2422617 RUNNING AT node0533
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 2422618 RUNNING AT node0533
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 2422619 RUNNING AT node0533
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 5 PID 2422620 RUNNING AT node0533
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
gllort commented 3 months ago

Hi, there seem to be two issues here, one related to capturing of user functions, and another related to instrumenting dynamic memory calls. To try to reduce the issues, I would suggest temporarily disabling support for memory calls, and focusing on the first problem with user functions. To do so, turn off the following option in the configuration file 'extrae_v230705.xml': . Does this change get rid of the message "WARNING: Negative value for MEMUSAGE_INUSE_EV..."? Are you still getting the "BAD TERMINATION" ones?

Turning into the user functions, a few things to check:

Suiiiii commented 3 months ago

Hi,

  1. the compiler is Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
  2. yes, the flags are -g -xCORE-AVX512 -qopenmp-simd -qopt-prefetch=2 -qopt-zmm-usage=high -fno-fnalias -finstrument-functions The file that contains address are like this pattern..
    f1 # 000000000042aa80
    f2 # 000000000042c460
    f3 # 000000000041c490
    f4 # 0000000000416220
    f5 # 000000000042ab80
    f6 # 00000000004357e0
    f7 # 0000000000434400
    f8 # 000000000041fe90
    ...
  3. No. In my local PC, I didnt use -rdynamic and it works, I can see my user-functions. I compiled with -rdynamic, it does not help.
  4. I always use the addresses.
  5. I just check it with Paraver. While the trace generated by cluster, the workspaces has no user-functions image

    But, if the user-functions are corrected collected, then in the workspaces, it shows the option.

    image

The extrae is newly compilled in the cluster, with binutils-2.42, libunwind-1.6.2, papi-7.1.0, libxml2-2.9.9

./configure --prefix=<folder>
 --with-mpi=$I_MPI_ROOT 
--with-mpi-libs=${I_MPI_ROOT}/lib/release 
--with-binutils=<folder>/binutils_install 
--with-papi=<folder>/PAPI_install 
--with-unwind=<folder>/libunwind_install 
--with-xml-prefix=<folder>/libxml2
 --without-dyninst

I dont remember what was the setup in my PC, but I think I didnt compiled with PAPI in my local PC.

in my extrae_v230705.xml, the dynamic-memory is not enabled

<?xml version='1.0'?>

<trace enabled="yes"
 home="<folder>"
 initial-mode="detail"
 type="paraver"
>

  <mpi enabled="yes">
    <counters enabled="yes" />
    <comm-calls enabled="yes" />
  </mpi>

  <openmp enabled="yes" ompt="no">
    <locks enabled="no" />
    <taskloop enabled="no" />
    <counters enabled="yes" />
  </openmp>

  <pthread enabled="no">
    <locks enabled="no" />
    <counters enabled="yes" />
  </pthread>

  <callers enabled="no">
    <mpi enabled="yes">1-3</mpi>
    <sampling enabled="no">1-5</sampling>
    <dynamic-memory enabled="no">1-3</dynamic-memory>
    <input-output enabled="no">1-3</input-output>
    <syscall enabled="no">1-3</syscall>
  </callers>

<user-functions enabled="yes"
  list="./user-functions.txt"
  exclude-automatic-functions="no">
  <counters enabled="yes" />
</user-functions>

  <counters enabled="yes">
    <cpu enabled="yes" starting-set-distribution="1">
      <set enabled="yes" domain="all" changeat-time="0">
        PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
      </set>
    </cpu>
    <network enabled="yes" />
    <resource-usage enabled="yes" />
    <memory-usage enabled="yes" />
  </counters>

  <storage enabled="no">
    <trace-prefix enabled="yes">TRACE</trace-prefix>
    <size enabled="no">5</size>
    <temporal-directory enabled="yes">/scratch</temporal-directory>
    <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory>
  </storage>

  <buffer enabled="yes">
    <size enabled="yes">50000000</size>
    <circular enabled="no" />
  </buffer>

  <trace-control enabled="no">
    <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file>
    <global-ops enabled="no"></global-ops>
  </trace-control>

  <others enabled="no">
    <minimum-time enabled="no">10M</minimum-time>
    <finalize-on-signal enabled="yes" 
      SIGUSR1="no" SIGUSR2="no" SIGINT="yes"
      SIGQUIT="yes" SIGTERM="yes" SIGXCPU="yes"
      SIGFPE="yes" SIGSEGV="yes" SIGABRT="yes"
    />
    <flush-sampling-buffer-at-instrumentation-point enabled="yes" />
  </others>

  <bursts enabled="yes">
    <threshold enabled="yes">500u</threshold>
    <mpi-statistics enabled="yes" />
  </bursts>

  <sampling enabled="no" type="default" period="50m" variability="10m" />

  <dynamic-memory enabled="no">
    <alloc enabled="yes" threshold="32768" />
    <free  enabled="yes" />
  </dynamic-memory>

  <pebs-sampling enabled="no">
    <loads enabled="no" frequency="100" minimum-latency="10" />
    <stores enabled="no" frequency="50">
    <offcore-l3-misses enabled="no" /> <!-- Read together with stores samples. -->
    </stores>
    <load-l3-misses enabled="no" frequency="25" />
  </pebs-sampling>

  <input-output enabled="no" internals="no"/>

  <syscall enabled="no" />

  <merge enabled="no" 
    synchronization="default"
    tree-fan-out="16"
    max-memory="512"
    joint-states="yes"
    keep-mpits="yes"
    translate-addresses="yes"
    sort-addresses="yes"
    translate-data-addresses="yes"
    overwrite="yes"
  />

</trace>

my trace.sh

#!/bin/bash

# echo to rank 0 only
echo_rank0 () {
  local msg=$1

  # get rank from various MPI implementations
  MPI_RANK=${MPI_RANK:=$PMI_RANK}
  MPI_RANK=${MPI_RANK:=$PMIX_RANK}
  MPI_RANK=${MPI_RANK:=$OMPI_COMM_WORLD_RANK}
  MPI_RANK=${MPI_RANK:=$ALPS_APP_PE}

  # test for rank 0 
  if  [[ $MPI_RANK = 0 ]]; then
    echo $msg 
  fi

  # fallback if no rank at all, i.e. outside mpirun
  if [[ $MPI_RANK = "" ]] ; then
    echo $msg
  fi
}

source <folder>/etc/extrae.sh

export EXTRAE_CONFIG_FILE=extrae_v230705.xml
#export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so # For C apps
export LD_PRELOAD=<folder>/lib/libmpitracef.so # For Fortran apps

## Run the desired program
$*

echo_rank0 "### Extrae tracing"
echo_rank0 "# Config:  $EXTRAE_CONFIG_FILE"
echo_rank0 "# Library: $EXTRAE_LIB"
echo_rank0 "# Trace:   $TRACE_NAME"
Suiiiii commented 3 months ago

I re-compiled extrae with latest version. the user-function is back. maybe there were some mistake with the environment.

But the warning is still there. and the program is terminated badly.

Welcome to Extrae 4.0.6
Extrae: Parsing the configuration file (extrae_v230705.xml) begins
Extrae: Tracing package is located on <folder>/extrae_install2
Extrae: Generating intermediate files for Paraver traces.
Extrae: MPI routines will collect HW counters information.
Extrae: All MPI_Comm_* calls will be traced.
Extrae: Number of user functions traced (XL runtime): 28
Extrae: Number of user functions traced (GCC runtime): 14
Extrae: User Function routines will NOT collect HW counters information.
Extrae: Warning! change-at-time time units not specified. Using seconds
Extrae: PAPI domain set to ALL for HWC set 1
Extrae: HWC set 1 contains following counters < PAPI_TOT_INS (0x80000032) PAPI_TOT_CYC (0x8000003b) PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) > - never changes
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 9, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 15, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 2, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 3, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 1, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 14, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 7, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 0, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 4, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 8, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 5, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 10, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 11, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 6, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 13, thread 0)
Extrae: Error! Hardware counter PERF_COUNT_HW_STALLED_CYCLES_BACKEND (0x40000026) cannot be added in set 1 (task 12, thread 0)
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
Extrae: Resource usage is enabled at flush buffer.
Extrae: Memory usage is enabled at flush buffer.
Extrae: Tracing buffer can hold 50000000 events
Extrae: Circular buffer disabled.
Extrae: Dynamic memory instrumentation is disabled.
Extrae: Basic I/O memory instrumentation is disabled.
Extrae: System calls instrumentation is disabled.
Extrae: Parsing the configuration file (extrae_v230705.xml) has ended
Extrae: Intermediate traces will be stored in <folder>
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 16 tasks and 1 threads

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
Extrae: Temporal directory (<folder>) is shared among processes.
Extrae: Final directory (<folder>) is shared among processes.
Extrae: Successfully initiated with 16 tasks and 1 threads

Box length Calculated:       19.8444462487654469
Box length Read from file:       19.8749259635630011
Box length is using the file value      19.8749259635630011      19.8749259635630011

This simulation is for the thermodynamic properties of [Ar]
molecule_M              =   0.039948 [kg/mole]
molecule_epsilon        = 143.120000 [K]
molecule_rnorm_an       =   3.357000 [A]
rstepinterpol           =   0.000420 [A]

With 2-Body setup : 

Number of atoms     =          108
box                 =  19.8749 [A]
rcut                =   9.9355 [A]
rcutmax_start       =  10.1068 [A]
rmin2b              =   1.8000 [A]
ru2b                =   1.6785 [A]

3-Body setup : 

rmin3b              =   2.2500 [A]
ru3b                =   2.2500 [A]
rstepmax3body       =      602
rstepinterpol3body  =   0.0132 [A]
  Simulation ist beendet
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447431000002000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447431000002000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447436000012000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447438000007000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447443000010000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447430000001000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447439000015000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447444000009000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447436000012000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447442000011000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447434000005000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447439000015000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447440000006000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447438000007000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447443000010000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447441000014000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447430000001000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447435000013000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447445000008000000.mpit
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447437000004000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447434000005000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447444000009000000.sym
WARNING: Negative value for MEMUSAGE_INUSE_EV detected (inuse=11567104+-1186443264-192880=-1175069040). Please submit a bug report.
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447441000014000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447442000011000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447440000006000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447435000013000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447432000000000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447445000008000000.sym
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447437000004000000.sym
Extrae: Intermediate raw trace file created : <folder>/run/set-0/TRACE@node0009.0002447433000003000000.mpit
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447432000000000000.sym
Extrae: Deallocating memory.
Extrae: Intermediate raw sym file created : <folder>/run/set-0/TRACE@node0009.0002447433000003000000.sym
Extrae: Application has ended. Tracing has been terminated.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2447366 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 2447367 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 2447368 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 2447369 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 2447370 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 5 PID 2447371 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 2447372 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 2447373 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 8 PID 2447374 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 9 PID 2447375 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 2447376 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 12 PID 2447378 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 13 PID 2447379 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 14 PID 2447380 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 15 PID 2447381 RUNNING AT node0009
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
(base) [suim@node0009 run]$ <folder>/bin/mpi2prv -f TRACE.mpits -e ./99_MCFortran.out -o "${parent_dir_name}.prv";
mpi2prv: Retrieving hardware counters definitions for ptask 1 from global SYM.
mpi2prv: A total of 3 symbols were imported from TRACE.sym file
mpi2prv: 0 function symbols imported
mpi2prv: 3 HWC counter descriptions imported
merger: Output trace format is: Paraver
merger: Extrae 4.0.6
mpi2prv: Assigned nodes < node0009 >
mpi2prv: Assigned size per processor < 1 Mbytes >
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447432000000000000.mpit is object 1.1.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447430000001000000.mpit is object 1.2.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447431000002000000.mpit is object 1.3.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447433000003000000.mpit is object 1.4.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447437000004000000.mpit is object 1.5.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447434000005000000.mpit is object 1.6.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447440000006000000.mpit is object 1.7.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447438000007000000.mpit is object 1.8.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447445000008000000.mpit is object 1.9.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447444000009000000.mpit is object 1.10.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447443000010000000.mpit is object 1.11.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447442000011000000.mpit is object 1.12.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447436000012000000.mpit is object 1.13.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447435000013000000.mpit is object 1.14.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447441000014000000.mpit is object 1.15.1 on node node0009 assigned to processor 0
mpi2prv: File <folder>/run/set-0/TRACE@node0009.0002447439000015000000.mpit is object 1.16.1 on node node0009 assigned to processor 0
mpi2prv: Time synchronization has been turned off
mpi2prv: Checking for target directory existence... exists, ok!
mpi2prv: Selected output trace format is Paraver
mpi2prv: Stored trace format is Paraver
mpi2prv: Enabling Time Synchronization (Node).
mpi2prv: Circular buffer enabled at tracing time? NO
mpi2prv: Parsing intermediate files
mpi2prv: Progress 1 of 2 ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Processor 0 succeeded to translate its assigned files
mpi2prv: Elapsed time translating files: 0 hours 0 minutes 0 seconds
mpi2prv: Elapsed time sorting addresses: 0 hours 0 minutes 0 seconds
mpi2prv: Generating tracefile (intermediate buffers of 419424 events)
         This process can take a while. Please, be patient.
mpi2prv: Progress 2 of 2 ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Elapsed time merge step: 0 hours 0 minutes 0 seconds
mpi2prv: Resulting tracefile occupies 1035118 bytes
mpi2prv: Removing temporal files... done
mpi2prv: Elapsed time removing temporal files: 0 hours 0 minutes 0 seconds
mpi2prv: Congratulations! xxxx.prv has been generated.
emercadal commented 3 months ago

The warning for a negative value in the MEMUSAGE_INUSE_EV is not related to the user functions, as you may have already guessed. The warning can be removed by setting the resource-usage and memory-usage options under the "counters" section in the .xml file to "no".

Regarding the other issue, to understand what is generating the BAD TERMINATION error in your execution we'll need to look at how you are running the application, commands and environment variables you are setting and any helper script you may be using.

Suiiiii commented 3 months ago

Hi, The warning is canceled by your command.

the BAD TERMINATION is only happened at cluster with Extrae.

compiling flag: -g -xCORE-AVX512 -qopenmp-simd -qopt-prefetch=2 -qopt-zmm-usage=high -fno-fnalias -finstrument-functions execution: mpirun -np 16 ./trace_nother.sh ./99_MCFortran.out;

the evironment variables I may send it by email to you.