ExtremeFLOW / neko

/ᐠ. 。.ᐟ\ᵐᵉᵒʷˎˊ˗
https://neko.cfd/
Other
159 stars 27 forks source link

unable to trace MPI calls using Neko v0.6.1 #1214

Open sumuzhe317 opened 3 months ago

sumuzhe317 commented 3 months ago

Hi, I'm using Neko v0.6.1, and I want to trace the MPI calls to find out the time spent and the frequency of MPI calls.

However, I have tried various tools, and neither of them works, including IPM, AMDuProf and Intel-oneAPI-itac. Before applying them to trace MPI calls of Neko, I write some demo codes to test the tools. AMDuProf and Intel-oneAPI-itac could works for demo codes. By the way, I test AMDuProf and Intel-oneAPI-itac for quantum-espresso and both of them works.

Could you give me some advise to find out the error? Thanks.

More information:

  1. I test AMDuProf on Neko using OpenMPI and MKL. The running command is "which mpirun -n 2 -host l011:2 -x PATH -x LD_LIBRARY_PATH -x LIBRARY_PATH AMDuProfCLI collect --trace mpi=full,openmpi -o ./ ./neko ./tgv.case". Through I adjust the number of processes and the time step in the input, neither of them works.
  2. I test Intel-oneAPI-itac using IntelMPI(icx, icpx, ifx) and MKL. When I execute Neko using "mpirun -trace -n 2 ./neko ./tgv.case", it seems that the problem is memory leak. When I run it using "mpirun -trace -n 1 ./neko ./tgv.case". It works, but didn't progress the trace information.
njansson commented 3 months ago

Neko uses 'mpi_f08', which seems to confuse certain profilers. I would suggest to try the tools on your smaller examples using the f08 mpi module as a first step.

sumuzhe317 commented 3 months ago

Neko uses 'mpi_f08', which seems to confuse certain profilers. I would suggest to try the tools on your smaller examples using the f08 mpi module as a first step.

Thanks for your response. Following your suggestions, I tried to compile different small examples and found out that use mpi_f08 may be the reason for my error. I write a small hello world code using mpi. By replacing use mpi with use mpi_f08, AMDuProf couldn't work.

My next question is that if I continue to use AMDuProf, could I simply replace use mpi_f08 with use mpi? I tried it but temporarily met some errors. Now I'm trying to fix it.

njansson commented 3 months ago

Neko uses 'mpi_f08', which seems to confuse certain profilers. I would suggest to try the tools on your smaller examples using the f08 mpi module as a first step.

Thanks for your response. Following your suggestions, I tried to compile different small examples and found out that use mpi_f08 may be the reason for my error. I write a small hello world code using mpi. By replacing use mpi with use mpi_f08, AMDuProf couldn't work.

My next question is that if I continue to use AMDuProf, could I simply replace use mpi_f08 with use mpi? I tried it but temporarily met some errors. Now I'm trying to fix it.

In principle yes, but it would need quite some changes in I/O routines. For testing, we do have a branch (https://github.com/ExtremeFLOW/neko/tree/fix/legacy_usempi) with use mpi,it more or less follow develop, thus quite different from v0.6.1, but can probably be used as a reference.

sumuzhe317 commented 3 months ago

Neko uses 'mpi_f08', which seems to confuse certain profilers. I would suggest to try the tools on your smaller examples using the f08 mpi module as a first step.

Thanks for your response. Following your suggestions, I tried to compile different small examples and found out that use mpi_f08 may be the reason for my error. I write a small hello world code using mpi. By replacing use mpi with use mpi_f08, AMDuProf couldn't work. My next question is that if I continue to use AMDuProf, could I simply replace use mpi_f08 with use mpi? I tried it but temporarily met some errors. Now I'm trying to fix it.

In principle yes, but it would need quite some changes in I/O routines. For testing, we do have a branch (https://github.com/ExtremeFLOW/neko/tree/fix/legacy_usempi) with use mpi,it more or less follow develop, thus quite different from v0.6.1, but can probably be used as a reference.

Thanks! I will try to follow your suggestions.

sumuzhe317 commented 3 months ago

@njansson I have tried more tools, for example, tau (https://www.cs.uoregon.edu/research/tau/home.php).

It doesn't work. I have emailed to AMD call for technique support.