LLNL / Caliper

Caliper is an instrumentation and performance profiling library
http://software.llnl.gov/Caliper/
BSD 3-Clause "New" or "Revised" License
343 stars 63 forks source link

Region stack mismatch error #545

Closed cwschilly closed 5 months ago

cwschilly commented 5 months ago

Hello,

I am trying to get a basic count of the MPI calls in an app that uses Kokkos.

Since the app uses Kokkos, I set the KOKKOS_TOOLS_LIBS env variable to point to libcaliper.so and did not add any additional Caliper annotations to the source code.

To get the MPI counts, I set the CALI_CONFIG env variable to mpi-report. I have also tried setting CALI_CONFIG="runtime-report(mpi.message.count)".

When I run the app with either of these configs, I get these warnings (one per process):

 == CALIPER: (1): Region stack mismatch: Trying to end
    "mpi.call.id"
  but region stack is empty!
  Ceasing region tracking!
== CALIPER: (2): Region stack mismatch: Trying to end
    "mpi.call.id"
  but region stack is empty!
  Ceasing region tracking!
== CALIPER: (3): Region stack mismatch: Trying to end
    "mpi.call.id"
  but region stack is empty!
  Ceasing region tracking!
== CALIPER: (0): Region stack mismatch: Trying to end
    "mpi.call.id"
  but region stack is empty!

Caliper still outputs a report at the end of the run, but the MPI counts are not complete (I know that there should be more counts than are present). I suspect that this Region stack mismatch causes Caliper to not be able to profile in some regions.

In what contexts does this warning show up? And how can I resolve it?

daboehme commented 5 months ago

Hi @cwschilly, thanks for the report. I've seen this happen if an MPI call internally invokes other MPI API functions, which they're not really supposed to do. Which MPI library version are you using?

I can work around it but it'll require a fix in Caliper.

cwschilly commented 5 months ago

Thanks for the response @daboehme, I'm using OpenMPI (v4.1.5).

cwschilly commented 5 months ago

Hi @daboehme , I also get this error when I use runtime-report; any idea what might cause this?

cali-query: TreeFormatter: Attribute "min#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "max#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "avg#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "percent_total#sum#sum#time.duration" not found.
daboehme commented 5 months ago

Hm that's interesting. Is this with the mpi.message.count option? If so can you try it without? Otherwise what Caliper version are you using?

cwschilly commented 5 months ago

@daboehme No, it's with no options selected: export CALI_CONFIG=runtime-report.

I'm using Caliper v2.9.1. I tried with 2.10 but all the same warnings are present.

cwschilly commented 5 months ago

Thanks @daboehme, that resolved my MPI issues.

daboehme commented 5 months ago

Thanks @daboehme, that resolved my MPI issues.

Great, thanks for letting me know! Did it fix the TreeFormatter errors as well? These are a bit strange. I think this could happen if either there is no data at all or some weird Caliper version mismatch.

cwschilly commented 5 months ago

@daboehme The TreeFormatter errors are still there.

daboehme commented 5 months ago

@daboehme The TreeFormatter errors are still there.

Hm, okay. Do you get any output at all? Can you run with CALI_LOG_VERBOSITY=1 set and see what that reports? It should be saying something like == CALIPER: runtime-report: Aggregate: flushed <N> snapshots. If that's 0 it then it somehow didn't process through any regions annotations at runtime.

cwschilly commented 5 months ago

Do you get any output at all?

No output other than the TreeFormatter error.

Can you run with CALI_LOG_VERBOSITY=1 set and see what that reports? It should be saying something like == CALIPER: runtime-report: Aggregate: flushed <N> snapshots. If that's 0 it then it somehow didn't process through any regions annotations at runtime.

You're right, it shows == CALIPER: (0): runtime-report: Aggregate: flushed 0 snapshots.

Here is all Caliper-related output:

== CALIPER: (0): Initialized
== CALIPER: (0): Creating channel default
== CALIPER: (0): default: No services enabled, default channel will not record data.
== CALIPER: (0): Creating channel builtin.configmgr
== CALIPER: (0): builtin.configmgr: Registered MPI service
== CALIPER: (0): builtin.configmgr: Registered mpiflush service
== CALIPER: (0): Creating channel runtime-report
== CALIPER: (0): runtime-report: Registered aggregation service
== CALIPER: (0): runtime-report: Registered event trigger service
== CALIPER: (0): runtime-report: Registered MPI service
== CALIPER: (0): runtime-report: Registered mpireport service
== CALIPER: (0): runtime-report: Registered timer service
== CALIPER: (0): Registered builtin ConfigManager

---- runs the application ----

== CALIPER: (0): builtin.configmgr: Flushing Caliper data
== CALIPER: (0): runtime-report: Flushing Caliper data
== CALIPER: (0): runtime-report: Aggregate: flushed 0 snapshots.
cali-query: TreeFormatter: Attribute "min#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "max#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "avg#sum#sum#time.duration" not found.
cali-query: TreeFormatter: Attribute "percent_total#sum#sum#time.duration" not found.
Path 
== CALIPER: (0): Finalizing ... 
== CALIPER: (0): default: Flushing Caliper data
== CALIPER: (0): Releasing channel default
== CALIPER: (0): builtin.configmgr: Flushing Caliper data
== CALIPER: (0): Releasing channel builtin.configmgr
== CALIPER: (0): Releasing channel runtime-report
== CALIPER: (0): Finished

This is with export CALI_CONFIG=runtime-report.

daboehme commented 5 months ago

Hm, that's interesting. Looks like it's simply not capturing anything at runtime. Are there any Caliper annotations in the code that you'd expect to see? If there are MPI calls you can try runtime-report,profile.mpi again so at least they should show up.

cwschilly commented 5 months ago

Ah, probably a misunderstanding on my part as to what runtime-report captures by default. I haven't added any specific notations, so it makes sense that there's nothing to capture. profile.mpi and profile.kokkos both work perfectly. Thanks for your help!