Open murfalo opened 2 weeks ago
Yes, looks like you are running out of address space for the memory buffer that is used to communicate partial results between threads. The output of RLIMIT_NPROC was added only because this seemed to be the limit one is most likely to hit, I don't recall address space being a problem before.
300MB stack is excessive...
Unusual, but may have been set during testing. I'm more intrigued by the low limit on address space (or virtual memory) that is causing the problem here - I'm more used to seeing this default to "unlimited" on any reasonably modern hardware ?
For context, this ulimit -a
is from an HPC system head node. The limits were imposed by the system administrators to restrict fair play usage. Perhaps unsurprisingly, OpenBLAS is not the only library or program that the ulimit -v
causes to crash.
I've been working with them to find a solution (some of the head nodes have a hard limit of 8-16 GB), but in the meantime I was curious why OpenBLAS was reporting an issue with ulimit -u
when ulimit -v
seemed to be the root of the issue. Would it be possible to modify OpenBLAS to report the correct problem, and/or suggest possible solutions (e.g., reducing OPENBLAS_NUM_THREADS
)? This could be helpful any future users that run into this issue.
Thanks for your help so far!
This is simply because issues with ulimit -u
are the only ones documented on the fork(2) manpage to raise EAGAIN, and the only cause of fork-related early aborts encountered so far.
I am attempting to use a simple python script:
This fails due to:
(full log attached, Python 3.9.18, numpy 1.26.4, libopenblas 0.3.24)
Here are my initial ulimits:
The error appears to be due to thread allocation, however:
ulimit -v unlimited
, or at least around67108684
, fixes this issue.This appears to be related to an issue reported in 2022 for numpy. It appears to be dead and the final comment suggested bringing it up here. Any ideas what might be happening?