[BUG/ISSUE] Out-of-memory issue in inversion scripts

msulprizio commented 1 year ago

Sarah Hancock reported seeing the error messages below when submitting the IMI on the South America domain via the SLURM scheduler on Harvard's Cannon cluster.

/var/slurmd/spool/slurmd/job29243271/slurm_script: line 94: 35085 Killed                  python postproc_diags.py $RunName $JacobianRunsDir $PrevDir $StartDate
/var/slurmd/spool/slurmd/job29243271/slurm_script: line 109:  7987 Killed                  python calc_sensi.py $nElements $Perturbation $StartDate $EndDate $JacobianRunsDir $RunName $sensiCache

After cancelling the job, the following additional error message printed out:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=29243271.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

msulprizio commented 1 year ago

Similar issues reported online suggest increasing the memory and/or cores requested in the job script.

sarahhancock commented 1 year ago

After allocating more memory, I no longer get this error, but instead the script just never actually goes past the postproc_diags.py part. I get "Calling calc_sensi.py" and nothing else until it runs out of time. Do you have an idea of approximately how much time this should take? I have only been allocating a few hours so far. Thanks so much for your help!

msulprizio commented 1 year ago

@sarahhancock reported that is seems to be working but calc_sensi.py is proceeding very slowly.

It has calculated sensitivity values up to 20190107_02 in 21 hours. If my math is correct, that would take 100 days to run! (7 hrs/day * 365 days = 2555 hours = 106 days)

@laestrada @djvaron Does this timing sound like it would be expected? Sarah is running her inversions for the entire South American domain as opposed to the smaller domains (e.g. Permian Basin) that we've tested on. Lucas, did you encounter this slow down with the Russian domain? Sarah has proposed giving it more cores to speed things up.

laestrada commented 1 year ago

Hi @msulprizio and @sarahhancock,

I never got to the point of calculating sensitivities for the Russian domain, so never tried running that section of code with such a large domain. I do see that calc_sensi is parallelized for using up to 24 cores (parallelized for each hour in a day), so that should help. I am not sure what the bottleneck in that code is -- I would need to do some timing tests.

sarahhancock commented 1 year ago

Thank you both! The problem is that when I use more cores, I get the error: "joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker."

After Googling, I found that this can happen when there is not enough RAM memory (https://github.com/scikit-learn-contrib/skope-rules/issues/18). I am requesting 100 GB with sbatch and this still happens!

Any advice about how to fix this would be greatly appreciated, thanks!

djvaron commented 1 year ago

I wonder if it's an I/O issue.

calc_sensi.py includes a pseudocode summary. The way it works is, for every hour of the inversion period, it compares the 3D base run output with the 3D output from all of the perturbation runs. So let's say you're doing a 1-year inversion. That's 8760 hours (the hours get parallelized as @laestrada mentioned). If the domain is very large, say 1000 elements, then calc_sensi will attempt to load 1001 netcdf files containing daily output, and then select the hour of interest. I think this will happen across 24 processes (for the 24 hours of each day) simultaneously. And those netcdf files are likely pretty big, though I'm not sure exactly how big.

One obvious update is to change the xr.load_dataset() statements in calc_sensi to xr.open_dataset() -- i think this would prevent loading the full netcdf file into memory every time.

@sarahhancock How big is your state vector? Is it ~1000 elements?

@msulprizio @laestrada Does this give you any other ideas for parallelization/memory optimization?

sarahhancock commented 1 year ago

Thanks for your help! My state vector is 600 elements. Unfortunately just switching to xr.open_dataset() still gives me the error. I am trying different numbers for "n_jobs" in the line: results = Parallel(n_jobs=-1)(delayed(process)(hour) for hour in hours) For example, it works when n_jobs=1 (since I think that means it's not parallelized at all - it's just very slow), but not when n_jobs = -1 (all CPUs used) or n_jobs = -2 (all but 1 CPUs). Any other suggestions would be greatly appreciated!

Update to this: n=3, 4 work, but n=6, 12, 24 do not

laestrada commented 1 year ago

@sarahhancock I think this might be because python isn't garbage collecting the old Datasets after they are loaded. A couple things you could experiment with:

running Dataset.close() after we have written our new nc file
calling gc.collect() to force a garbage collection
using the chunk argument in open_dataset THread on a similar issue.

sarahhancock commented 1 year ago

@laestrada Hi Lucas, thank you so much for the suggestions. I tried these but unfortunately I kept getting the same error. I think the problem is actually just that my domain is so large. I tried rewriting the code so that no files are opened in parallel (i.e. I save out the sensitivities for each element-hour pair in a large numpy array), but this array requires over 200 GB for just 1 day. I think this means that if I am trying to run all those 24 processes at once, that would also need over 200 GB memory with garbage collection and everything being closed correctly. Now, I am running it as originally written (i.e. running 24 in parallel) on huce_bigmem with 200 GB and 24 cores it seems to be working, but it is still slow (between 10 min - 1 hour per day).

sarahhancock commented 1 year ago

As a follow-up question, I rewrote the calc_sensi.py script to get it down to 5 min / day by moving file reading outside of the parallelization and saving some values in a numpy array, but this requires around 50 GB more memory for me. I set it up to run 1 month at a time on bigmem, so I can run this for all the months at the same time and each month takes around 2.5 hours. This seemed great to me!

However, each file is 9.6GB and there are 365*24 files so I don't have enough space to write them all (that's 84 TB for all of the data)! Any advice for how to deal with this? I tried doing file compression with ncks and nccopy but got "HDF errors" and I'm not sure why.

laestrada commented 1 year ago

Hi Sarah, can you send what command you were using and the exact error? But here is a list of commonly used tools for nc4 compression as well.

Also on the above issue, if you feel like trying another option, I came across zarr which claims to have the "ability to store and analyze datasets far too large fit onto disk".

yantosca commented 1 year ago

@sarahhancock @msulprizio @djvaron @laestrada: We have removed several important memory leaks in GEOS-Chem 14.1.1/HEMCO 3.6.1 and later versions. When the IMI is updated to these versions you may see lower memory usage.

geoschem / integrated_methane_inversion

[BUG/ISSUE] Out-of-memory issue in inversion scripts #82