arttumiettinen / pi2

C++ library and command-line software for processing and analysis of terabyte-scale volume images locally or on a computing cluster.
GNU General Public License v3.0
48 stars 13 forks source link

Processes being killed by the cgroup out-of-memory handler when running on RA cluster #12

Closed nclprz closed 1 year ago

nclprz commented 1 year ago

Hi!

I am encountering an issue when trying to stitch on the RA cluster some 3x3 scans acquired at TOMCAT. I am creating a stitch_settings.txt file by using the stitch_settings_from_tomcat_scans.py script. Then I run from within the same folder the nr_stitcher.py script. The volumes are correctly loaded and binned, but when the job is submitted to the cluster it gets killed with this message:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=329325.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

This goes on for some seconds (jobs being re-submitted and then killed again), until it finally gives up because the job has been re-submitted too many times. If it helps identifying the issue, I had stitched some other volumes during a beamtime last September and I believe that I was following the same steps without encountering any errors. Am I misremembering some steps or has anything changed?

Best, Niccolo

arttumiettinen commented 1 year ago

The problem is most probably caused by unexpectedly large tile image size. There is a fix for this in the latest experimental version. Please try re-compiling using the experimental branch (git checkout experimental & make -j8 NO_OPENCL=1) followed by re-running the stitching process. Let me know if that solves the problem.

Another temporary work-around would be to reduce max_block_size parameter.

nclprz commented 1 year ago

Using the latest experimental version fixed the issue. Thank you! :)