darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 28 forks source link

Darshan measurement fails on MPI_Comm_free #994

Closed bellenlau closed 4 months ago

bellenlau commented 4 months ago

Hello,

I am using Darshan/3.4.4 (runtime) to instrument an MPI application based on intel-oneapi-mpi/2021.10.0. The measurement fails with the following error

bort(873021445) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_free: Invalid communicator, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0xe0dbc60) failed
PMPI_Comm_free(83).: Invalid communicator

I have installed darshan with spack as darshan-runtime+hdf5+parallel-netcdf ; the code profiled has no issues without instrumentation and the darshan install works with simpler codes. I am wondering whether MPI_Comm_free API is supported or not?

Thank you,

Laura

shanedsnyder commented 4 months ago

Are you sure both the application and darshan-runtime are built against the same MPI (intel-oneapi-mpi/2021.10.0)? I don't think I've tested this particular MPI yet, but generally speaking we like to ensure that Darshan and the application are using the same MPI implementation.

What system is this on? Is it possible for you to share some code that demonstrates the problem so that I could try to reproduce and debug myself?

Darshan shouldn't at all be affected by application usage of MPI communicators, so think something else weird is going on.

bellenlau commented 4 months ago

Hello, you were right. I noticed that the hash of the intel-oneapi-mpi installation for the application instrumented was not the same hash of the intel-oneapi-mpi installation used for darsahan. Thank you.