OceanParcels / Parcels

Main code for Parcels (Probably A Really Computationally Efficient Lagrangian Simulator)
https://www.oceanparcels.org
MIT License
295 stars 136 forks source link

Strange netcdf behaviour when running multiple instances of parcels #925

Closed nvogtvincent closed 4 years ago

nvogtvincent commented 4 years ago

I've come across some strange behaviour when running multiple oceanparcels processes at the same time (not using parcel's parallel capability, I mean running multiple parcels scripts on one machine at the same time). One of two things happens - the first is that there's no error, but the netcdf file generated at the end contains extra information. For instance, when running two processes at once, instead of the time axis being

0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100

it might be

0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 90, 100

The amount of 'extra data' seems to increase with the number of processes I'm running simultaneously (e.g. when running 15 at once I've had files that are 3x larger than they should be), so it looks like the process that generates the netcdf file is getting mixed up between the different processes.

The second option is that the process crashes on the output_file.export() line, with the error message No such file or directory: 'out-XXXXXXX/0/0.npy. I've mainly seen this version when running the different instances as foreground processes.

I've tested this on two machines, and with the processes being run in the foreground or as background nohup processes, with the same issues in all cases. This issue occurs even when the different scripts are independent, i.e. using different files for the velocity field (and obviously writing to different netcdf files). All of these scripts work with no issues if only one process is running. It looks like the final step that converts the numpy files into netcdf is getting mixed up between output from different processes if more than one instance of parcels is running. Is there any way of getting around this (apart from running each file in its own directory)?

erikvansebille commented 4 years ago

Thanks for reporting @Plagioclase, this indeed is very strange behaviour!

A few questions:

  1. Which version of Parcels are you running with?
  2. Can you confirm that the two processes print that they write to different out-XXXXXXX directories? And that indeed both directories exist during the processes, and that files get written to them?
  3. Could you add in your scripts the line
    print(output_file.tempwritedir_base)

    And indeed check that these are different? (and correspond to the directories under 2)

  4. One option is that MPI somehow confuses the particlefile.export(). Could you, in particlefile.py change
    - try:
    -    from mpi4py import MPI
    - except:
    -     MPI = None
    + MPI = None

    That forces the function not to use MPI

Very curious to hear about your findings!

nvogtvincent commented 4 years ago

Thanks Erik, I think I've worked out what the issue is! The processes were all writing to the same out-XXXXXXX directory and the reason is because I was using the same random seed in all my runs to ensure that the set-up is consistent between them. But it looks like the process that generates the temporary directory name is also using that seed, so all of my processes were generating the same 'random' directory name. Defining tempwritedir when setting up the ParticleFile solves the issue.

erikvansebille commented 4 years ago

Ah thanks! Still I think that this would then classify as a bug. We can expect users to want to seed their random number generator, but the output-directory naming should not rely on that. I'm reopening so that we can come up with a fix

erikvansebille commented 4 years ago

Hi @Plagioclase; FYI I have now implemented a change in #931 where an Error is thrown if a ParticleFile wants to use a temporary output directory that already exists. This would have caught your error, by at the start already throwing an error when both processes wanted to write to the same directory. Do you agree this is a good solution?

nvogtvincent commented 4 years ago

This sounds very sensible to me - I don't think there's any need for anything more since it's a niche issue in the first place and it's easily corrected once the user is aware of the issue. Thanks for fixing it!