Closed claudiofgcardoso closed 4 years ago
Forgot to mention that after all this I tried several things: 1) created a new python enviromnent with Parcels v2.1.1 and the problem persisted; 2) removed parcels v2.1.1 from the newly created enviroment and installed the v2.0.0, problem persisted; 3) then I removed the newly created enviromnent and created a new one, but this time with parcels v2.0.0 and it worked!
So maybe the problem is related with a dependency...?
Thank you for letting us know, @claudiofgcardoso. This indeed appears similar to #703 and perhaps also #668?
We did change to Field chunking in v2.1.2 (see also the documentation on Chunking the FieldSet with dask and #632), but the idea of that change was that less data would be loaded in memory, so it is surprising if these memory Issues pop up due to the Field chunking
We will look into this!
Hi @claudiofgcardoso and @sjvg (cc @CKehl). Could you try running your simulation with field_chunksize=False
in the FieldSet
creation? That turns off the auto-chunking, and should lead to more stable memory usage; see also this documentation
Hi @erikvansebille,
I tried my simulation adding field_chunksize=False
but I got the same memory error after 20% completion,
For info, like @claudiofgcardoso, I have created a new environment with parcels=2.0.0 and managed with this setup to run the simulation successfully .
OK thanks. Could you check the memory use during the simulation (ideally with v2.0.0 and with 2.1.2)?
We used the following code to create the graphs in #668 , if it helps
import psutil
from time import sleep
import numpy as np
for proc in psutil.process_iter():
if proc.name() == 'python3.7':
print('Process[%d] (%s): %d MB' % (proc.pid, proc.name(), proc.memory_info().rss / (1024*1024)))
print(proc)
Hi @erikvansebille. Indeed I've been testing several values for field_chunksize
since yesterday. field_chunksize=400
seems to be the optimal setup for my simulation, as the execution is very fast and suprisingly memory usage doesn't change much (< 2%). But as @sjvg I have the same memory issues when adding field_chunksize=False
or field_chunksize='auto'
.
Will check memory use as you suggested.
Ah thanks, this is really useful information already! I guess it makes sense that smaller field_chunksize
yield lower memory usage, as long as the particles are not distributed over the entire domain. But both of you were doing regional studies, right?
Well, unfortunately it seems that the simulation stoped again at 26% with "slurmstepd: error: Exceeded step memory limit at some point." when adding field_chunksize=400
...
My simulation extends the entire North Atlantic. So, memory usage was low when particles were close to the source, but as you suggested, when they got distributed over a larger area memory issues rised once again.
What I find very strange is that when adding field_chunksize=False
or leaving as default (i.e. 'auto') I am getting a new error at the beggining of the simulation:
"(...)
Running first part of simulation: from 2015/12/29 to 2010/01/01
INFO: Compiled PlasticParticleBeachTesting_2DAdvectionRK4BeachTesting_2DUnBeaching_PrevCoordStokesDragBeachTesting_2DWindageDragBeachTesting_2DBrownianMotion2DBeachTesting_2DAgeingTotalDistance ==> /tmp/parcels-1005/b05fcba4060a41ad0f45420f66944abf_0.so
Particle [243] beached after stokes or windage interaction. Deleting (-27.1558 38.6033 0.51 3.15274e+08)
Traceback (most recent call last):
File "run_MAC_NEMO_nochunk.py", line 105, in
This only happen when running the entire lenght of the sim. (10 years). Doing the test over only 1 month it does work as usual ...
I'm finding this issue very confusing for several reasons (I'm using two computers for my tests: a laptop and a HPC using "slurm-client"):
1) When running one month test:
field_chunksize=400
, False
or auto
.field_chunksize=False
or auto
. When field_chunksize=400
memory is stable 2) When running a 10-year run:
field_chunksize=False
or auto
because of the new error stated in the previous comment. field_chunksize=400
simulation starts apparently normal on both computers, but then:
I don't understand what is the possible difference between the laptop and HPC, since both versions of parcels are the same (v2.1.2).
hi @claudiofgcardoso, about recovery kernel I got the same error than you after some years of backward simulation recovery_kernel = recovery_map[p.state]
I don't know why... to "solve" it I did my own kernel that deletes particles when are out of my domain of interest.. I'll tell you if it works, but it would be better if there is a way to fix it.
Hello everyone, I'm a new developer at the parcels-group. I'm also currently looking on the issue. After investigating #632 , I recognized that @delandmeterp introduced a garbage collector ('import gc' and 'gc.collect()') to the examples, which may have to do with cleaning up irrelevant memory.
May in be possible for you to add the gc-import to the top of your python script, and call 'gc.collect()' before (each) 'FieldSet.from_
On the how-to for the garbage collection, have a look on 'parcels/example_moving_eddies.py', line 1 and line 175.
Hi @CKehl. Unfortunately I'm not able to test the garbage collector for a large period because of the error mentioned previously with recovery_kernel = recovery_map[p.state]
. I am only able to run the 10-year period in backward in time and with field_chunksize=400
. When I try running it forward in time or with field_chunksize='auto'
/False
it crashes.
Running for 2 month, however, memory keeps increasing gradually until the end of the simulation, reaching 12.8gb of usage. So, it seems that gc.collect()
doesn't make much difference.
Hi Claudio - I've read your error carefully once again and considered some options. First of all, when comparing the HPC system and your laptop - are both devices running with MPI activated ?
Now, on the 10 year run, if all particles are distributed nye-arbitrarily over the globe, there's what I saw now very little we can do (i.e. needs further investigation). The only thing to keep in mind in that situation, especially when running with MPI, is that due to the chunking per MPI process, there will be more duplicate blocks of field values in the memory, because each process loads its own halo regions. In comparison to a single-process work, where there should not be any halo regions allocated, it can result in a considerable memory overhead.
On the error of field_chunksize=auto
, this parameter is directly forwarded to Dask - a python-based memory management library. On their website (https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking), we can see that this chunk size estimation is based on a user-defined config variable. Could you perhaps login to your HPC system, navigate in your home directory to ${HOME}/.config/dask
and attach your dask.yaml
file here as attachment ? In short, if the config variable array.chunk_size
is not defined, then Dask will have a problem to determine a good chunking size automatically. You can try something like the following as content for dask.yaml
temporary-directory: /tmp # Directory for local disk like /tmp, /scratch, or /local
array:
svg:
size: 120 # pixels
chunk-size: 128 MiB
Hi @CKehl, sorry for the late reply. I've been trying to solve the other issue with the deletion of particles, still without success.
MPI is deactivated in both systems. I also checked the dask.yaml
as you suggested. They are equal in both systems but look different than your example - chunk-size: 128 MiB
is missing (see below).
# temporary-directory: null # Directory for local disk like /tmp, /scratch, or /local
# array:
# svg:
# size: 120 # pixels
I tried to add chunk-size: 128 MiB
but problem persists. It is very strange that even when I declare field_chunksize=400
memory issues keep rising in HPC.
I will use the laptop for my simulations from now on until a possible breakthrough is available.
we're fixing the issue these days for the newer versions of parcels, where this error starts to occur. Until then, it is a good advice to take the setup that is working (for now). We keep you updated.
Hi Claudio,
Actually, dask.yaml
files are not equal - please note the #
commenting tags in your file. They mean that those information are not passed to Dask, because they are commented out.
We are on to a fix, but I'd like to understand one thing in order to be sure we also fix the issue for you: What are the differences between your laptop and the HPC system ?
a) which Parcels-version are you actually using on the laptop ? v 2.0.0 or 2.1.2 ?
b) does your python environment on the laptop include Dask ? You can determine this by opening a Python console (with cmd.exe, gnome-terminal or any console of the operation system you have). Load your anaconda environment, start up python by typing python
, then type
import dask
if you get no error, all is well. If you don't have dask, the error looks like this:
>>> import dask
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'dask'
c) What is the size of your 'backup memory' ? Depending on your system, this space for virtual memory is called either page memory (Mac OSX), page files (Windows) or swap (UNIX). Now, before a quick-guide how to give the info, here's the reason for asking: job systems on HPC clusters give you a hard memory limit - if that is reached, the process (i.e. Parcels) is killed. If you run things outside the job submission system, you have access to virtual memory - basically: more memory than you actually have in the device. That prevents the process from being aborted - or delays it for a long time. On my laptop, I have a small swap (just 2GB), so my laptop actually behaves like a job submission system. If you have MacOS or Windows, those backup memories can be enormous, which would explain the behaviour you experience.
How to determine the size ?
UNIX: find an app called "System Monitor", activate the tab labelled "Resources", and in the memory section there is an information on the Swap size (e.g. ... of X GB). Report that.
MacOS: follow this instruction on how to check Mac OS X memory usage. Report the output von vm_stat
.
Windows: follow this instruction on how to manage windows 10 virtual memory. Make a screenshot of the window in step (8) and report it.
If the difference in your case from HPC and laptop cannot be explained by one of the 3 points, then I may actually need to dive deeper in your special case. Otherwise, the worked out fix we are about to submit shall work for you.
Hi Ckehl, You are indeed right. Nevertheless, I uncommented the files and the results were the same. Regarding your questions: a) I’m using Parcels version 2.1.2 on python 3.7 on both systems; b) Both python environments include dask, the import was successful on both systems; c) I don’t believe that differences in “Backup memory” are the issue here, here’s why:
I will wait for the new fix and then test it. Meanwhile I'll work on the laptop
Dear @claudiofgcardoso ,
Also in this issue (and here even more so than for the plotting), the new parcels version 2.1.5
should provide you with a solution to the memory issue. Also, for NEMO data access and chunking, look at the file parcels/examples/example_dask_chunk_OCMs.py
, where there are plenty of test_NEMO
examples to copy from and take as a base your your own implementation.
Cheers,
Christian
Hello all,
I am having a similar issue as #703. When using the beta 2.0.0 version of Parcels I had no problem running a 2D simulation with 10 years of daily Mercator (nested grid with 0.083 and 0.25º) files and 15 particles being released daily. Now, using parcels 2.1.2 (no MPI) on a single node of 120Gb it raises the following error:
"(...) 26% (49852800.0 of 189043200.0) |# | Elapsed Time: 0:27:24 ETA: 1:52:44 26% (49939200.0 of 189043200.0) |# | Elapsed Time: 0:27:28 ETA: 2:00:27 /var/log/slurm/spool_slurmd/job3571814/slurm_script: line 16: 13617 Killed python run_MAC_NEMO.py -stype backward -pnumber 2 > run_MAC_NEMO_backward.log Wed Jan 8 16:53:03 CET 2020 slurmstepd: error: Exceeded step memory limit at some point."
First I thought it was related with the fact I was now running a simulation with 300 particles released at a 5 day interval in a HPC. When I decreased the number of particles to 12 the issue persisted, always when the simulation was at ~26%. So I guess the problem isn't related with the number of particles.
I also ran the original sim. (15 particles in a normal linux laptop with parcels 2.1.2) and the process stops at 8% because of lack of memory. deferred_load is set as default (i.e. True), so I don't understand what is going on..
MAC_NEMO.txt run_MAC_NEMO.txt
Any help and suggestion is greatly appreciated!
Cláudio