JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.88k stars 5.49k forks source link

Workers on cluster terminating #41208

Open WouterJRV opened 3 years ago

WouterJRV commented 3 years ago

Maybe it is interesting to put this here as well, see this original disussion on discourse

Recently, I have been performing a number of extensive simulations with timeevolution.mcwf from the quantumoptics package https://github.com/qojulia/QuantumOptics.jl/blob/e8a3ed060278bb7e4790474053cbe06ce55d656e/src/mcwf.jl#L52-L89 on a cluster (within @distributed for I ask each worker/core to run a separate simulation that can take of the order of a dozen hours, often actually in an additional nested for loop) and I noticed crashes 'worker terminated' without a clear error message, and this is hardware dependent (crashes occur less frequently on a newer cluster, and the same code can crash or not depending on .just changing one parameter).

Since the gist of this function is actually solving a heavy DiffEq (sets of roughly 50000 complex ones, with strict tolerances like 1e-18) with jumps through throught the from DifferentialEquations.jl package, I have also put an issue on their github but they referred me here. I managed to solve the problem back then by breaking the run up in smaller pieces, but I don't like the unpredictability for the future. Why wasn't there a clear error like "out of memory" or so? How can I prevent, or at least anticipate, such behavior in the future?

I'm willing to send exact code for this if interested, but only in private for now. The version is 1.5.3

mgkuhn commented 3 years ago

If you suspect out-of-memory problems, you probably first should check what exact operating system and local configuration you use, how it has been configured to deal with OOM situations (e.g., the Linux OOM killer can be tweaked in a number of ways, and often is in HPC clusters), and what the operating-system's log files and process monitoring tools say about how memory usage evolved during the runtime of your process. Local system administrators may be of more help at that stage. Recall that e.g. Linux by default over-commits memory and in such cases, application software (such as Julia's memory manager) does not actually get informed that its process ran out of memory; it just gets killed and you have to consult syslog.