OceanParcels / Parcels

Main code for Parcels (Probably A Really Computationally Efficient Lagrangian Simulator)
https://www.oceanparcels.org
MIT License
295 stars 136 forks source link

MPI Run not splitting particles #810

Closed olmozavala closed 4 years ago

olmozavala commented 4 years ago

Hello, I'm running a model for ocean litter and because it takes days to finish I am trying to use MPI. After some tests, I realized that the time that takes to finish doesn't improve while running it on parallel.

I see that several processes are launched, but I'm timing each run and the performance doesn't improve (sometimes it even gets worse). Here are some times that I get by running a test example with mupltie processors or just 1:

------- With diffusion --------- Run for 1 day 10 proc |-- Auto --| -- 1024-- | -- 2048 -- (chunksize) time=330 time=445 time=346 time=325 time=495 time=346 time=325 time=480 time=347 time=326 time=500 time=354 .... .... ....

Run for 1 day 1 proc |-- Auto --| -- 1024-- | -- 2048 -- (chunksize) time=240 time=268 time=275

------- Without diffusion --------- Run for 1 day 3 proc |-- Auto --| -- 1024-- time=288 time=289
time=289 time=287
time=299 time=284

Run for 1 day 1 proc |-- Auto --| -- 1024--|-- 2048 -- time=250 time=281 time=263

Do I have to do something besides running it with mpirun -np # MYFILE.py?

I have also been playing with the chunksize because I was also getting out of memory on HPC (which is mentioned in another issue). At HPC I have parcels 2.1.5 and at home, I have 2.1.4.

A second problem is that when saving the trajectories to a file with out_parc_file.export() (that works perfectly with a single proc) it fails when running it on MPI because of 'permission denied' (multiple processes are trying to save the same file at the same time). Is this the right way to save the results when running it on parallel?

My code is at https://github.com/olmozavala/ocean_litter and the test file is T_WorldLItter.py. If you are interested in running it I can make an effort to build a self-contained test.

I'll really appreciate your help. Thanks

olmozavala commented 4 years ago

Any update? I really need help here. Do you have an easy way to prove that the particles are being split as supposed to? (k-means clustering by each processor) Thank you very much in advance.

angus-g commented 4 years ago

Do you have mpi4py installed? Parcels won't ensure this is the case, even if you're running under MPI.

You could print pset.size and see that it sums to the number of particles you expect to be spawned, and that the number of particles is distributed.

erikvansebille commented 4 years ago

Hi @olmozavala, sorry for the delay in responding. Another way to check if you run Parcels in parallel is to check the file size of the out-put files in the temporary directory. Each core of mpirun writes in its own sub-directory in out-*/, so if you have 2 cores you will see the subdirectories out-*/0/ and out-*/1/. Since each core should write only half the particles, the files in these directories should only be half the size of what they are in the single-processor run. If they're the same size as in the single-processor-run, then that's a clear sign that the mpi is not working

Note that mpi-support in Parcels is still somewhat 'under development', so it might not work as smoothly as we hope. But please let us know how you're going, we'll try to be more responsive!

olmozavala commented 4 years ago

Thank you very much @angus-g and @erikvansebille. mpi4py is installed on HPC but somehow was not installed on my local machine. I tested again from scratch in my local PC and HPC, and I do see the splitting of particles happening on my machine, and not in HPC. I even added the import to mpi4py to be sure that is installed. Here an output from HPC (I get the same warnings on my local machine, but on HPC each process uses the total number of particles 32300 and crashes when trying to save the output) image image

olmozavala commented 4 years ago

I made a simpler test case that you can find here https://github.com/olmozavala/ocean_litter/blob/master/SimplestTest.py

The result at my personal computer is as expected, particles are grouped by proc. image

At HPC I have tried with different mpirun options, but I am not able to make it work. Any experience with submitting jobs to a cluster? image

angus-g commented 4 years ago

I wonder if maybe mpi4py hasn't been built correctly against your HPC's MPI implementation? I'm not sure how you installed it, but there are some instructions at their documentation that show you how to be sure you're specifying the right MPI compilers, etc.

erikvansebille commented 4 years ago

Hmm, very strange. @angus-g's suggestion of checking the mpi4py installation is a good one. Another option could be that kmeans is not installed on the HPC? Parcels needs kmeans for the partitioning, see https://github.com/OceanParcels/parcels/blob/master/parcels/particleset.py#L175

angus-g commented 4 years ago

Another option could be that kmeans is not installed on the HPC

That should raise EnvironmentError though, right? https://github.com/OceanParcels/parcels/blob/master/parcels/particleset.py#L27-L32

erikvansebille commented 4 years ago

That should raise EnvironmentError though, right?

True, but still worth checking as last resort?

CKehl commented 4 years ago

There is another point - schedule systems. Example, on some systems (e.g. Cartesius), MPI is fully controlled by the cluster-wide schedule system, such as SLURM or SGE. If the environment is set up so to only allow multi-processing within the scheduler, running via mpiexec would not get you anywhere.

I will create a very small, very quick mpi4py test script that just prints you out that many processors are used. If that script works, then we need to investigate further. If it fails, then it's a problem with starting an MPI-based script on your cluster.

More information on the subject, consider the guide: https://slurm.schedmd.com/mpi_guide.html

Here, it talks about the different MPI-Slurm setup modes. If the cluster is setup in mode 1 (which is common), then mpirun or mpiexec won't do much - they would just spawn N separate jobs - no communication, no resource sharing.

In other words: don't use mpiexec - create a shell script for sbatch and run the python program via srun. N as number of nodes is then part of the shell script or part of the sbatch start command, not the mpiexec/srun function.

CKehl commented 4 years ago

This file now contains a python test script for MPI that just accumulated the PE ids in an MPI manner. In addition, there are two shell scripts: one that invokes the script in an mpiexec environment, and one that does it via SLURM. Thus: locally the mpiexec shell script shall succeed whereas on your HPC cluster I assume it will not output the expected accumulation. On the HPC cluster, on the other hand, the slurm shell script should give the expected output. Looking forward to your feedback.

test_MPI.zip

olmozavala commented 4 years ago

This is awesome, thank you very much for the help.

You found the error, I was running my script with a shell script for sbatch but I was still using mpirun inside of it, rather than srun. This has solved the problem. image

Thank you very much for all your help.