behavior of likwid-mpirun -np option

aaronknister commented 7 years ago

If I run likwid-mpirun with the -np option nothing seems to happen. I've traced it down to the elseif block around line 1857. It seems as though ppn isn't getting set to a value the program finds reasonable and I'm struggling to understand the logic in the code:

    if ppn == 0 then
        ppn = 1
    end
    if ppn > maxppn and np > maxppn then
        ppn = maxppn
    elseif np < maxppn then
        ppn = np
    elseif maxppn == np then
        ppn = maxppn
    end

Is it possible that

if ppn > maxppn and np > maxppn then

ought to be this?

if ppn > maxppn and np > maxnp then

I'm confused about the other elsif statements there. Could you help me understand the intent there?

Thanks!

-Aaron

aaronknister commented 7 years ago

I forgot to add, I'm experiencing this using the -g option. -np works without the g option however it does print this warning (because ppn is initially set to 0):

WARN: Processes cannot be equally distributed
WARN: You want 96 processes on 4 hosts with 1 per host.
WARN: Sanitizing number of processes per node to 24

I'm thinking it ought to infer a default value of ppn based on the information provided by the scheduler. This is what I was trying to implement earlier but I didn't feel I could without understanding what that block is supposed to do.

TomTheBear commented 7 years ago

Hi Aaron,

I'm currently not able to test it because I'm on my way back from a LIKWID tutorial and the internet connection in trains isn't stable. I added some comments to the if statements, I hope this clarifies what is happening here.

-- If available processes per node (ppn) are larger than the available slots on the hosts (ppn > maxppn) and the total processes require multiple hosts (np > maxppn), sanitize ppn value to the slots available on the hosts. 
if ppn > maxppn and np > maxppn then
   ppn = maxppn
-- if all processes fit on a single host, use only a single host with np processes
elseif np < maxppn then
  ppn = np
-- if the processes fit exactly on the host, use all slots. It should be able to change the previous elseif into np <= maxppn
elseif maxppn == np then
  ppn = maxppn
end

I'll check likwid-mpirun again when I'm back at the office. When I remember it correctly, not all job scheduler provide information how many processes should be run on each node, so I had to determine them like this. SLURM has a environment variable for that, so likwid-mpirun should use it.

aaronknister commented 7 years ago

Thanks @TomTheBear! That's exactly what I needed. I'll keep looking at this too and let you know what I come up with.

aaronknister commented 7 years ago

@TomTheBear I'm finally getting back to working on likwid-mpirun and making it work for us with SLURM. The issue I'm running into is, I think, the amount of information likwid-mpirun needs to understand currently such as how to interact with schedulers and mpi implementation combination to launch tasks in the desired layout.

For example, if I have an asymmetric allocation (someone requests an odd number of tasks and say 28 end up on the first node and 27 on another) then this doesn't wok:

likwid-mpirun -g ENERGY -np 55 ./mpi_hello

I get an error about processes being unequally distributed and it also seems to think I'm asking for 1 task per host.

I wonder if rather than having likwid-mpirun have to understand the subtleties of each scheduler and mpi combination enough to launch a job of a given layout if we couldn't instead interpose likwid-mpirun (or create another script) between the mpi implementation and the mpi task itself, e.g.

srun likwid-mpi -g ENERGY -nperdomain S:14 ./mpi_hello

then all likwid-mpi would need to understand is how to identify its relative rank in the job (e.g. usually reading some environment variables) rather than having to understand how to launch a desired layout. It becomes the user's responsibility to launch the tasks properly. This is similar, I think, to how some of SGI's placement tools seem to work (https://www.nas.nasa.gov/hecc/support/kb/using-sgi-omplace-for-pinning_287.html) as well as Intel VTune (https://software.intel.com/en-us/node/544016).

Then the question becomes how to capture output and summarize it the way likwid-mpirun does. Perhaps likwid-mpi would require one to specify a results directory (that in the batch script could be made unique) e.g.:

srun likwid-mpi -r <results_dir> -g ENERGY -nperdomain S:14 ./mpi_hello

which after the fact could be summarized with the logic from liwid-mpirun using a separate tool (perhaps called likwid-mpi-report?).

I'm willing to implement this, but I'd like your concurrence before I do anything :)

-Aaron

aaronknister commented 7 years ago

@TomTheBear just wondering if you've had a chance to think this over?

TomTheBear commented 7 years ago

Hi, I have been thinking about this, yes. I fully understand your approach and basically I don't have a problem when we split the scripts in multiple parts.

What I didn't get is how you want to do the pinning of MPI processes and maybe threads? If you call srun likwid-mpi ..., do you have the full power to manage the pinning/amount of processes started/...? The other tools don't seem to handle the distribution of MPI processes to hosts. It took me quite some time to set up the SLURM support in likwid-mpirun and the implemented support is the only one I could find covering all features. If we have full power on that, we can do it as you proposed. If not, I would suggest to keep likwid-mpirun but strip it down to do only the pinning/node selection and forward all further options to an interceptor script likwid-mpi which does the likwid-perfctr and remaining pinning stuff. Futhermore, likwid-mpirun can be extended to take a folder to report the measurements, if a user uses likwid-mpi directly.

TomTheBear commented 4 years ago

This won't be part of the upcoming release.

But basically, that's: srun likwid-perfctr -o /tmp/output_%h_%r.txt -g X <exec> And then a script that reads all output files (the final step of likwid-mpirun). %h and %r are substituted with hostname and MPIrank.

RRZE-HPC / likwid

behavior of likwid-mpirun -np option #91