Index errors in MPI mode with >1 process.

christensen5 commented 4 years ago

Running Parcels in MPI mode with mpirun -np x yields the following error whenever x > 1.

 mpirun -np 2 python ~/parcels/parcels/examples/example_stommel.py -p 10

INFO: Generating FieldSet output with basename: stommel
WARNING: Particle initialisation from field can be very slow as it is computed in scipy mode.
Stommel: Advecting 10 particles for 600 days, 0:00:00
WARNING: Particle initialisation from field can be very slow as it is computed in scipy mode.
Stommel: Advecting 10 particles for 600 days, 0:00:00
INFO: Compiled MyParticleAdvectionRK4UpdateP ==> /tmp/parcels-1000/8422a3273ba45be865eb1ae5df6c5f92_0.so
INFO: Compiled MyParticleAdvectionRK4UpdateP ==> /tmp/parcels-1000/8422a3273ba45be865eb1ae5df6c5f92_1.so

Exception ignored in: <bound method ParticleFile.__del__ of <parcels.particlefile.ParticleFile object at 0x7f3e7c4a7e10>>
Traceback (most recent call last):
  File "/home/alexander/parcels/parcels/particlefile.py", line 195, in __del__
    self.close()
  File "/home/alexander/parcels/parcels/particlefile.py", line 200, in close
    self.export()
  File "/home/alexander/parcels/parcels/particlefile.py", line 369, in export
    data = self.read_from_npy(global_file_list, len(self.time_written), var)
  File "/home/alexander/parcels/parcels/particlefile.py", line 328, in read_from_npy
    t_ind_used[t_ind] = 1
IndexError: index 121 is out of bounds for axis 0 with size 121

I am running the latest development version of parcels (git pull from github repo master branch).

erikvansebille commented 4 years ago

Thanks for reporting, @christensen5. However, I tried on four different systems (laptop, dekstop and two different HPC systems), and I can't replicate this error on any of them.

Some steps you could take are: 1) Check on another system? Can you replicate the error? 2) What does the content of the temporary file directory look like? Can you report that here (using e.g. the tree command)? 3) A bit of a long shot, but what happens if you run with the #678 Pull Request? That PR includes some changes to ParticleFile handling, perhaps it fixes it (the PR works for linux/mac; not for windows yet)

Look forward to hear the results of these steps!

christensen5 commented 4 years ago

Thanks Erik for the suggestions - it does indeed seem this is an issue with my own laptop. Running example_stommel.py in MPI mode on Imperial's CX1 seems to work fine.

What does the content of the temporary file directory look like? Can you report that here (using e.g. the tree command)?

I've attached a .txt file containing the output of tree out-NFFJUAUO (the temp directory created by my 2-core mpi stommel run).

out-NFFJUAUO.txt

A bit of a long shot, but what happens if you run with the #678 Pull Request?

I'll give this a try once I'm done with some ongoing runs to profile the memory usage of my microbe simulation!

EDIT: Original .txt file was tree output for a different parcels run. It is now correctly the output of mpirun -np 2 python ~/parcels/parcels/examples/example_stommel.py -p 10

christensen5 commented 4 years ago

I think I have located the source of the bug. In line 369 of particlefile.py we have:

data = self.read_from_npy(global_file_list, len(self.time_written), var)

On my laptop, when running example_stommel.py, the value of the argument len(self.time_written) is 121, reflecting the number of timestamps at which the run is asked to save particle trajectories.

When this is passed to self.read_from_npy(..., time_steps = len(self.time_written),...) the value 121 is used to set the length of t_ind_used. Then, self.read_from_npy loops over all the files in global_file_list, filling in successive entries of t_ind_used as it goes.

However, there are 121*2 = 242 files in global_file_list since I am running in MPI mode with 2 cores. This is why read_from_npy() fails after 121 loop iterations, since it runs out of entries in t_ind_used to fill.

I am not enormously familiar with particlefile.py, so I don't know in what way this differs from the desired behaviour. I am quite puzzled though how this could be an issue on my local machine and not elsewhere...if the above is indeed the problem I would imagine it would be causing issues on everyone's machine.

erikvansebille commented 4 years ago

Hi @christensen5. The length of t_ind_used refers to the number of time steps, and that should be 121 on both processors (the run is 600 days long, outputting every 5 days, plus at day 0). So it should be the same in single-processor and MPI mode.

The workflow is as follows:

A list of all the time_written in each file is created (https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L361)
The unique time_written are stored in self.time_written (https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L366)
The length of this self.time_written is then used as the number of timesteps in creating the data array (https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L369)

I really don't understand why this goes wrong on your laptop. Can you check what the self.time_written is? Perhaps first reduce the length of the Stommel simulation so that you don't get 121 values ;-)

christensen5 commented 4 years ago

Hi Erik,

Indeed, self.time_written has length 121, hence so does t_ind_used.

The issue is that global_file_list has length 121*(number of MPI processes), and it is also passed to read_from_npy() along with self.time_written, whereupon the length of the for-loop in lines 317-330 is set to the length of global_file_list.

This means that after 121 iterations of said loop, line 328 fails since t_ind_used is only 121 entries long, whereas line 328 attemps to access its 122nd element at that point.

erikvansebille commented 4 years ago

The algorithm you're describing is intentional, and should be correct (indeed, it works on all our systems and your HPC).

Can you send me the individual files in your out-* directory (e.g. zipped), so that I can explore myself what's going on?

christensen5 commented 4 years ago

Here you go:

out-VPQYLGSA.zip

erikvansebille commented 4 years ago

Thanks @christensen5, for sending through these files. This really helped with the debugging!

It turns out that the writing of the files goes wrong, not the reading.

See the below snippet, which compares your breaking output to a nonbreaking one (attached here)

import numpy as np

for proc in ['0', '1']:
    breaking = np.load('out-VPQYLGSA/%s/100.npy' %proc, allow_pickle=True).item()
    print('breaking %s' %proc,breaking['id'])

for proc in ['0', '1']:
    nonbreaking = np.load('out-VNNVMAQJ/%s/100.npy' %proc, allow_pickle=True).item()
    print('nonbreaking %s' %proc, nonbreaking['id'])

with output

breaking 0 [0 1 2 3 4 5 6 7 8 9]
breaking 1 [0 1 2 3 4 5 6 7 8 9]
nonbreaking 0 [0 1 2 3 4]
nonbreaking 1 [5 6 7 8 9]

As you see, for some reason both processors write all ten particles in your breaking laptop case, where the expected (nonbreaking) behaviour is that particles 0-4 are run and written by processor 0, and particles 5-9 are run and written by processor 1

So I now suspect that the issue is somewhere with the partitioning of the ParticleSet, in lines https://github.com/OceanParcels/parcels/blob/master/parcels/particleset.py#L106-L125. I can't really dig much further, as I can't replicate the error. But could you check what's going on with your laptop installation of Parcels, and see if indeed both processors run all particles (e.g. by adding a print(pid) after line 122)?

out-VNNVMAQJ.zip

christensen5 commented 4 years ago

Thanks Erik, this finally allowed me to discover the problem. It appears that while creating a new venv to run MPI Parcels jobs I failed to install scikit-learn.

In lines 20-24, particleset.py attempts to import both MPI and sklearn.cluster.KMeans and sets MPI = None if either package fails to load. Thereupon the rest of the simulation proceeds as normal except without any of the code in lines 98-125 being executed. Presumably this is exactly what is desired if a user is running in single-core mode, and therefore doesn't have mpi4py or scikit-learn installed. It does mean that in my case, no partitioning was performed and each process ran all particles before encountering the bug I've been struggling with.

This is clearly my fault for having installed without a dependency, but perhaps it would be useful for the except statement in line 24 to spawn a warning message telling the user that importing MPI and/or scikit-learn has failed, and that the run will continue in single-core mode?

EDIT: fixed urls

erikvansebille commented 4 years ago

Great that this is sorted out! I have now created PR #729, that throws an error if MPI is installed but sklearn isn't.

What I'm somewhat surprised by, though, is that line 111 did not crash in your setup. Do you have any idea how that could pass?

christensen5 commented 4 years ago

If I have correctly understood the try/catch block in lines 20-24, then the fact that I did not have sklearn installed meant that the variable MPI was set to None in line 24.

This means that the condition if MPI: in the if-clause in line 98 would not have been met, so none of the subsequent code between lines 98-125 would have been run, including line 111, which otherwise would indeed have crashed.

erikvansebille commented 4 years ago

Yes of course. Good point. Makes sense now! So you were essentially just running parcels twice, with the same ParticleSet on each processor, when you did mpirun -np 2.

Does #729 make sense to you? Do you think that will fix the Issue?

christensen5 commented 4 years ago

So you were essentially just running parcels twice, with the same ParticleSet on each processor, when you did mpirun -np 2.

Yes I think so, and since Parcels was expecting some partitioning of the ParticleSet to have been done, it failed later on when it had to combine 2x as many .npy files as it had expected.

Does #729 make sense to you? Do you think that will fix the Issue?

Makes perfect sense and I expect it should fix it yes - I can't think of a way the same issue would arise now that we're testing separately for sklearn installation, and stopping the run with an EnvironmentError if it's not installed.

OceanParcels / Parcels

Index errors in MPI mode with >1 process. #717