Closed christensen5 closed 4 years ago
Thanks for reporting, @christensen5. However, I tried on four different systems (laptop, dekstop and two different HPC systems), and I can't replicate this error on any of them.
Some steps you could take are:
1) Check on another system? Can you replicate the error?
2) What does the content of the temporary file directory look like? Can you report that here (using e.g. the tree
command)?
3) A bit of a long shot, but what happens if you run with the #678 Pull Request? That PR includes some changes to ParticleFile
handling, perhaps it fixes it (the PR works for linux/mac; not for windows yet)
Look forward to hear the results of these steps!
Thanks Erik for the suggestions - it does indeed seem this is an issue with my own laptop. Running example_stommel.py
in MPI mode on Imperial's CX1 seems to work fine.
What does the content of the temporary file directory look like? Can you report that here (using e.g. the tree command)?
I've attached a .txt file containing the output of tree out-NFFJUAUO
(the temp directory created by my 2-core mpi stommel run).
A bit of a long shot, but what happens if you run with the #678 Pull Request?
I'll give this a try once I'm done with some ongoing runs to profile the memory usage of my microbe simulation!
EDIT: Original .txt file was tree output for a different parcels run. It is now correctly the output of mpirun -np 2 python ~/parcels/parcels/examples/example_stommel.py -p 10
I think I have located the source of the bug. In line 369 of particlefile.py
we have:
data = self.read_from_npy(global_file_list, len(self.time_written), var)
On my laptop, when running example_stommel.py, the value of the argument len(self.time_written)
is 121, reflecting the number of timestamps at which the run is asked to save particle trajectories.
When this is passed to self.read_from_npy(..., time_steps = len(self.time_written),...)
the value 121 is used to set the length of t_ind_used
. Then, self.read_from_npy
loops over all the files in global_file_list
, filling in successive entries of t_ind_used
as it goes.
However, there are 121*2 = 242 files in global_file_list
since I am running in MPI mode with 2 cores. This is why read_from_npy()
fails after 121 loop iterations, since it runs out of entries in t_ind_used
to fill.
I am not enormously familiar with particlefile.py
, so I don't know in what way this differs from the desired behaviour. I am quite puzzled though how this could be an issue on my local machine and not elsewhere...if the above is indeed the problem I would imagine it would be causing issues on everyone's machine.
Hi @christensen5. The length of t_ind_used
refers to the number of time steps, and that should be 121 on both processors (the run is 600 days long, outputting every 5 days, plus at day 0). So it should be the same in single-processor and MPI mode.
The workflow is as follows:
time_written
in each file is created (https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L361)time_written
are stored in self.time_written
(https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L366)self.time_written
is then used as the number of timesteps in creating the data array (https://github.com/OceanParcels/parcels/blob/master/parcels/particlefile.py#L369)I really don't understand why this goes wrong on your laptop. Can you check what the self.time_written
is? Perhaps first reduce the length of the Stommel simulation so that you don't get 121 values ;-)
Hi Erik,
Indeed, self.time_written
has length 121, hence so does t_ind_used
.
The issue is that global_file_list
has length 121*(number of MPI processes), and it is also passed to read_from_npy()
along with self.time_written
, whereupon the length of the for-loop in lines 317-330 is set to the length of global_file_list
.
This means that after 121 iterations of said loop, line 328 fails since t_ind_used
is only 121 entries long, whereas line 328 attemps to access its 122nd element at that point.
The algorithm you're describing is intentional, and should be correct (indeed, it works on all our systems and your HPC).
Can you send me the individual files in your out-*
directory (e.g. zipped), so that I can explore myself what's going on?
Here you go:
Thanks @christensen5, for sending through these files. This really helped with the debugging!
It turns out that the writing of the files goes wrong, not the reading.
See the below snippet, which compares your breaking output to a nonbreaking one (attached here)
import numpy as np
for proc in ['0', '1']:
breaking = np.load('out-VPQYLGSA/%s/100.npy' %proc, allow_pickle=True).item()
print('breaking %s' %proc,breaking['id'])
for proc in ['0', '1']:
nonbreaking = np.load('out-VNNVMAQJ/%s/100.npy' %proc, allow_pickle=True).item()
print('nonbreaking %s' %proc, nonbreaking['id'])
with output
breaking 0 [0 1 2 3 4 5 6 7 8 9]
breaking 1 [0 1 2 3 4 5 6 7 8 9]
nonbreaking 0 [0 1 2 3 4]
nonbreaking 1 [5 6 7 8 9]
As you see, for some reason both processors write all ten particles in your breaking laptop case, where the expected (nonbreaking) behaviour is that particles 0-4 are run and written by processor 0, and particles 5-9 are run and written by processor 1
So I now suspect that the issue is somewhere with the partitioning of the ParticleSet
, in lines https://github.com/OceanParcels/parcels/blob/master/parcels/particleset.py#L106-L125. I can't really dig much further, as I can't replicate the error. But could you check what's going on with your laptop installation of Parcels, and see if indeed both processors run all particles (e.g. by adding a print(pid)
after line 122)?
Thanks Erik, this finally allowed me to discover the problem. It appears that while creating a new venv to run MPI Parcels jobs I failed to install scikit-learn.
In lines 20-24, particleset.py
attempts to import both MPI
and sklearn.cluster.KMeans
and sets MPI = None
if either package fails to load. Thereupon the rest of the simulation proceeds as normal except without any of the code in lines 98-125 being executed. Presumably this is exactly what is desired if a user is running in single-core mode, and therefore doesn't have mpi4py or scikit-learn installed. It does mean that in my case, no partitioning was performed and each process ran all particles before encountering the bug I've been struggling with.
This is clearly my fault for having installed without a dependency, but perhaps it would be useful for the except statement in line 24 to spawn a warning message telling the user that importing MPI and/or scikit-learn has failed, and that the run will continue in single-core mode?
EDIT: fixed urls
Great that this is sorted out! I have now created PR #729, that throws an error if MPI is installed but sklearn isn't.
What I'm somewhat surprised by, though, is that line 111 did not crash in your setup. Do you have any idea how that could pass?
If I have correctly understood the try/catch block in lines 20-24, then the fact that I did not have sklearn installed meant that the variable MPI
was set to None
in line 24.
This means that the condition if MPI:
in the if-clause in line 98 would not have been met, so none of the subsequent code between lines 98-125 would have been run, including line 111, which otherwise would indeed have crashed.
Yes of course. Good point. Makes sense now! So you were essentially just running parcels twice, with the same ParticleSet on each processor, when you did mpirun -np 2
.
Does #729 make sense to you? Do you think that will fix the Issue?
So you were essentially just running parcels twice, with the same ParticleSet on each processor, when you did mpirun -np 2.
Yes I think so, and since Parcels was expecting some partitioning of the ParticleSet to have been done, it failed later on when it had to combine 2x as many .npy files as it had expected.
Does #729 make sense to you? Do you think that will fix the Issue?
Makes perfect sense and I expect it should fix it yes - I can't think of a way the same issue would arise now that we're testing separately for sklearn installation, and stopping the run with an EnvironmentError if it's not installed.
Running Parcels in MPI mode with
mpirun -np x
yields the following error whenever x > 1.I am running the latest development version of parcels (git pull from github repo master branch).