Kitware / UPennContrast

UPenn ?
https://upenn-contrast.netlify.com/
Apache License 2.0
8 stars 6 forks source link

Intermittent OOM issues for Girder #489

Closed arjunrajlab closed 1 year ago

arjunrajlab commented 1 year ago

We get out of memory issues that seem to kill Girder intermittently:

image

The issues seem to occur when multiple people are using the server, especially with somewhat larger files, but it is somewhat intermittent.

arjunrajlab commented 1 year ago

@manthey I found a file that is able to reproduce the issue! If you load this file:

https://www.dropbox.com/scl/fi/xduq3y1buv6h8acu2rxbo/20230807_133502_470_LVE016-HKC-1dpi-HSV-1-WT-ICP4_GFP-MX1_Cy3-IFNb_A594-IRF3_Cy5-DDX58_Cy7_Stitched.nd2?rlkey=kf1vjnyl4nefq2bn2cu37env6&dl=0

then just scrub rapidly up and down in Z before the tile cache builds. That seems to cause the issue.

arjunrajlab commented 1 year ago

Here's what I found in the syslog:

arjun@raj:/var/log$ grep -E 'OOM|kill' syslog
Aug  8 12:46:26 raj kernel: [8295223.437605] girder invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
Aug  8 12:46:26 raj kernel: [8295223.437630]  oom_kill_process.cold+0xb/0x10
Aug  8 12:46:26 raj kernel: [8295223.437839] [   7384]  1000  7384   116627      164   118784       62             0 gsd-rfkill
Aug  8 12:46:26 raj kernel: [8295223.438051] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope,task=girder,pid=1641364,uid=0
Aug  8 12:46:26 raj systemd[1]: docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope: A process of this unit has been killed by the OOM killer.
arjun@raj:/var/log$ 
arjunrajlab commented 1 year ago

(Reproduced on two separate machines.)

manthey commented 1 year ago

There are several things going on here. I'll break them out into individual issues.

manthey commented 1 year ago

This particular test file shows up a bunch of the problems because it is larger than 2k x 2k per frame, but not optimally chunked internally AND has 6 channels.

arjunrajlab commented 1 year ago

@manthey could you document here the progress to date and what remains? I think the histogram pre-cache and pixel value requests are important next steps but not sure.

manthey commented 1 year ago

I discovered a bug (fix is https://github.com/girder/large_image/pull/1283) where depending on what we did, we could prematurely close nd2 file handles. This could be forced by asking for a bad style, but could also happen with other conditions. If we saw segfaults in syslog that weren't OOM issues, this could have been the issue. The proximate problem was reported as a numpy operation attempting to access memory it wasn't allowed to.

manthey commented 1 year ago

With the various changes and issues, I recommend we close this issue and if we see Girder crash again, we create a new issue with a syslog entry for the crash (whether an OOM kill or segfault).

manthey commented 1 year ago

For reference, this has generated issue #502 for precaching histograms, #503 for making fewer pixel requests, plus a variety of PRs in large_image to address memory and stability.

arjunrajlab commented 1 year ago

Excellent! Yes, I think we can close the issue for now. We were still having some OOM stability issues when multiple users were accessing the server, but we also just increased memory to 128GB and everything seems to be fine for now.