OceanParcels / Parcels

Main code for Parcels (Probably A Really Computationally Efficient Lagrangian Simulator)
https://www.oceanparcels.org
MIT License
295 stars 136 forks source link

Issue with memory exceeding limit #711

Closed claudiofgcardoso closed 4 years ago

claudiofgcardoso commented 4 years ago

Hello all,

I am having a similar issue as #703. When using the beta 2.0.0 version of Parcels I had no problem running a 2D simulation with 10 years of daily Mercator (nested grid with 0.083 and 0.25º) files and 15 particles being released daily. Now, using parcels 2.1.2 (no MPI) on a single node of 120Gb it raises the following error:

"(...) 26% (49852800.0 of 189043200.0) |# | Elapsed Time: 0:27:24 ETA: 1:52:44 26% (49939200.0 of 189043200.0) |# | Elapsed Time: 0:27:28 ETA: 2:00:27 /var/log/slurm/spool_slurmd/job3571814/slurm_script: line 16: 13617 Killed python run_MAC_NEMO.py -stype backward -pnumber 2 > run_MAC_NEMO_backward.log Wed Jan 8 16:53:03 CET 2020 slurmstepd: error: Exceeded step memory limit at some point."

First I thought it was related with the fact I was now running a simulation with 300 particles released at a 5 day interval in a HPC. When I decreased the number of particles to 12 the issue persisted, always when the simulation was at ~26%. So I guess the problem isn't related with the number of particles.

I also ran the original sim. (15 particles in a normal linux laptop with parcels 2.1.2) and the process stops at 8% because of lack of memory. deferred_load is set as default (i.e. True), so I don't understand what is going on..

MAC_NEMO.txt run_MAC_NEMO.txt

Any help and suggestion is greatly appreciated!
Cláudio

claudiofgcardoso commented 4 years ago

Forgot to mention that after all this I tried several things: 1) created a new python enviromnent with Parcels v2.1.1 and the problem persisted; 2) removed parcels v2.1.1 from the newly created enviroment and installed the v2.0.0, problem persisted; 3) then I removed the newly created enviromnent and created a new one, but this time with parcels v2.0.0 and it worked!

So maybe the problem is related with a dependency...?

erikvansebille commented 4 years ago

Thank you for letting us know, @claudiofgcardoso. This indeed appears similar to #703 and perhaps also #668?

We did change to Field chunking in v2.1.2 (see also the documentation on Chunking the FieldSet with dask and #632), but the idea of that change was that less data would be loaded in memory, so it is surprising if these memory Issues pop up due to the Field chunking

We will look into this!

erikvansebille commented 4 years ago

Hi @claudiofgcardoso and @sjvg (cc @CKehl). Could you try running your simulation with field_chunksize=False in the FieldSet creation? That turns off the auto-chunking, and should lead to more stable memory usage; see also this documentation

sjvg commented 4 years ago

Hi @erikvansebille, I tried my simulation adding field_chunksize=False but I got the same memory error after 20% completion, For info, like @claudiofgcardoso, I have created a new environment with parcels=2.0.0 and managed with this setup to run the simulation successfully .

erikvansebille commented 4 years ago

OK thanks. Could you check the memory use during the simulation (ideally with v2.0.0 and with 2.1.2)?

We used the following code to create the graphs in #668 , if it helps

import psutil
from time import sleep
import numpy as np
​
for proc in psutil.process_iter():
    if proc.name() == 'python3.7':
        print('Process[%d] (%s): %d MB' % (proc.pid, proc.name(), proc.memory_info().rss / (1024*1024)))
        print(proc)
claudiofgcardoso commented 4 years ago

Hi @erikvansebille. Indeed I've been testing several values for field_chunksize since yesterday. field_chunksize=400 seems to be the optimal setup for my simulation, as the execution is very fast and suprisingly memory usage doesn't change much (< 2%). But as @sjvg I have the same memory issues when adding field_chunksize=False or field_chunksize='auto' .

Will check memory use as you suggested.

erikvansebille commented 4 years ago

Ah thanks, this is really useful information already! I guess it makes sense that smaller field_chunksize yield lower memory usage, as long as the particles are not distributed over the entire domain. But both of you were doing regional studies, right?

claudiofgcardoso commented 4 years ago

Well, unfortunately it seems that the simulation stoped again at 26% with "slurmstepd: error: Exceeded step memory limit at some point." when adding field_chunksize=400...

My simulation extends the entire North Atlantic. So, memory usage was low when particles were close to the source, but as you suggested, when they got distributed over a larger area memory issues rised once again.

What I find very strange is that when adding field_chunksize=False or leaving as default (i.e. 'auto') I am getting a new error at the beggining of the simulation:

"(...) Running first part of simulation: from 2015/12/29 to 2010/01/01 INFO: Compiled PlasticParticleBeachTesting_2DAdvectionRK4BeachTesting_2DUnBeaching_PrevCoordStokesDragBeachTesting_2DWindageDragBeachTesting_2DBrownianMotion2DBeachTesting_2DAgeingTotalDistance ==> /tmp/parcels-1005/b05fcba4060a41ad0f45420f66944abf_0.so Particle [243] beached after stokes or windage interaction. Deleting (-27.1558 38.6033 0.51 3.15274e+08) Traceback (most recent call last): File "run_MAC_NEMO_nochunk.py", line 105, in dt=args.dt, pnumber=args.pnumber) File "run_MAC_NEMO_nochunk.py", line 87, in p_advect kukulka = kukulka, diffusion=diffusion, wind=wind, flag_3D=flag_3D, dt=dt) File "/home/claudio/OceanParcels/MAC/MAC_NEMO.py", line 590, in MAC_NEMO execute_run(pset, kernel, start_date, finish_date, timestep, output_file, stop_release) File "/home/claudio/OceanParcels/MAC/MAC_NEMO.py", line 493, in execute_run recovery={ErrorCode.ErrorOutOfBounds: DeleteParticle}) File "/home/claudio/miniconda3/lib/python3.7/site-packages/parcels/particleset.py", line 475, in execute self.kernel.execute(self, endtime=time, dt=dt, recovery=recovery, output_file=output_file) File "/home/claudio/miniconda3/lib/python3.7/site-packages/parcels/kernel.py", line 355, in execute recovery_kernel = recovery_map[p.state] KeyError: 2"

This only happen when running the entire lenght of the sim. (10 years). Doing the test over only 1 month it does work as usual ...

claudiofgcardoso commented 4 years ago

I'm finding this issue very confusing for several reasons (I'm using two computers for my tests: a laptop and a HPC using "slurm-client"):

1) When running one month test:

2) When running a 10-year run:

I don't understand what is the possible difference between the laptop and HPC, since both versions of parcels are the same (v2.1.2).

ignasivalles commented 4 years ago

hi @claudiofgcardoso, about recovery kernel I got the same error than you after some years of backward simulation recovery_kernel = recovery_map[p.state] I don't know why... to "solve" it I did my own kernel that deletes particles when are out of my domain of interest.. I'll tell you if it works, but it would be better if there is a way to fix it.

CKehl commented 4 years ago

Hello everyone, I'm a new developer at the parcels-group. I'm also currently looking on the issue. After investigating #632 , I recognized that @delandmeterp introduced a garbage collector ('import gc' and 'gc.collect()') to the examples, which may have to do with cleaning up irrelevant memory.

May in be possible for you to add the gc-import to the top of your python script, and call 'gc.collect()' before (each) 'FieldSet.from_(...)' call ? Does this remove the excess memory issue ? Do you get other errors that the particle advection couldn't find some field value in return ?

On the how-to for the garbage collection, have a look on 'parcels/example_moving_eddies.py', line 1 and line 175.

claudiofgcardoso commented 4 years ago

Hi @CKehl. Unfortunately I'm not able to test the garbage collector for a large period because of the error mentioned previously with recovery_kernel = recovery_map[p.state]. I am only able to run the 10-year period in backward in time and with field_chunksize=400. When I try running it forward in time or with field_chunksize='auto'/False it crashes.

Running for 2 month, however, memory keeps increasing gradually until the end of the simulation, reaching 12.8gb of usage. So, it seems that gc.collect() doesn't make much difference.

CKehl commented 4 years ago

Hi Claudio - I've read your error carefully once again and considered some options. First of all, when comparing the HPC system and your laptop - are both devices running with MPI activated ?

Now, on the 10 year run, if all particles are distributed nye-arbitrarily over the globe, there's what I saw now very little we can do (i.e. needs further investigation). The only thing to keep in mind in that situation, especially when running with MPI, is that due to the chunking per MPI process, there will be more duplicate blocks of field values in the memory, because each process loads its own halo regions. In comparison to a single-process work, where there should not be any halo regions allocated, it can result in a considerable memory overhead.

On the error of field_chunksize=auto, this parameter is directly forwarded to Dask - a python-based memory management library. On their website (https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking), we can see that this chunk size estimation is based on a user-defined config variable. Could you perhaps login to your HPC system, navigate in your home directory to ${HOME}/.config/dask and attach your dask.yaml file here as attachment ? In short, if the config variable array.chunk_size is not defined, then Dask will have a problem to determine a good chunking size automatically. You can try something like the following as content for dask.yaml

temporary-directory: /tmp     # Directory for local disk like /tmp, /scratch, or /local

array:
  svg:
    size: 120  # pixels
  chunk-size: 128 MiB
claudiofgcardoso commented 4 years ago

Hi @CKehl, sorry for the late reply. I've been trying to solve the other issue with the deletion of particles, still without success.

MPI is deactivated in both systems. I also checked the dask.yaml as you suggested. They are equal in both systems but look different than your example - chunk-size: 128 MiB is missing (see below).

# temporary-directory: null     # Directory for local disk like /tmp, /scratch, or /local

# array:
#   svg:
#     size: 120  # pixels

I tried to add chunk-size: 128 MiB but problem persists. It is very strange that even when I declare field_chunksize=400 memory issues keep rising in HPC.

I will use the laptop for my simulations from now on until a possible breakthrough is available.

CKehl commented 4 years ago

we're fixing the issue these days for the newer versions of parcels, where this error starts to occur. Until then, it is a good advice to take the setup that is working (for now). We keep you updated.

CKehl commented 4 years ago

Hi Claudio, Actually, dask.yaml files are not equal - please note the # commenting tags in your file. They mean that those information are not passed to Dask, because they are commented out.

We are on to a fix, but I'd like to understand one thing in order to be sure we also fix the issue for you: What are the differences between your laptop and the HPC system ? a) which Parcels-version are you actually using on the laptop ? v 2.0.0 or 2.1.2 ? b) does your python environment on the laptop include Dask ? You can determine this by opening a Python console (with cmd.exe, gnome-terminal or any console of the operation system you have). Load your anaconda environment, start up python by typing python, then type

import dask

if you get no error, all is well. If you don't have dask, the error looks like this:

>>> import dask
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'dask'

c) What is the size of your 'backup memory' ? Depending on your system, this space for virtual memory is called either page memory (Mac OSX), page files (Windows) or swap (UNIX). Now, before a quick-guide how to give the info, here's the reason for asking: job systems on HPC clusters give you a hard memory limit - if that is reached, the process (i.e. Parcels) is killed. If you run things outside the job submission system, you have access to virtual memory - basically: more memory than you actually have in the device. That prevents the process from being aborted - or delays it for a long time. On my laptop, I have a small swap (just 2GB), so my laptop actually behaves like a job submission system. If you have MacOS or Windows, those backup memories can be enormous, which would explain the behaviour you experience. How to determine the size ? UNIX: find an app called "System Monitor", activate the tab labelled "Resources", and in the memory section there is an information on the Swap size (e.g. ... of X GB). Report that. MacOS: follow this instruction on how to check Mac OS X memory usage. Report the output von vm_stat. Windows: follow this instruction on how to manage windows 10 virtual memory. Make a screenshot of the window in step (8) and report it.

If the difference in your case from HPC and laptop cannot be explained by one of the 3 points, then I may actually need to dive deeper in your special case. Otherwise, the worked out fix we are about to submit shall work for you.

claudiofgcardoso commented 4 years ago

Hi Ckehl, You are indeed right. Nevertheless, I uncommented the files and the results were the same. Regarding your questions: a) I’m using Parcels version 2.1.2 on python 3.7 on both systems; b) Both python environments include dask, the import was successful on both systems; c) I don’t believe that differences in “Backup memory” are the issue here, here’s why:

I will wait for the new fix and then test it. Meanwhile I'll work on the laptop

CKehl commented 4 years ago

Dear @claudiofgcardoso , Also in this issue (and here even more so than for the plotting), the new parcels version 2.1.5 should provide you with a solution to the memory issue. Also, for NEMO data access and chunking, look at the file parcels/examples/example_dask_chunk_OCMs.py, where there are plenty of test_NEMO examples to copy from and take as a base your your own implementation. Cheers, Christian