HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

Understanding a future's log output #51

Closed rimorob closed 3 years ago

rimorob commented 3 years ago

I have a future failing due to lack of RAM. The sessioninfo file seems to only contain R session info, whereas the log file doesn't tell me about the remote host on which the future was running (unless the error happens in preparation locally) or the SLURM settings, such as memReq. I know what I'm passing but I don't know how slurm is actually being called. In short, I'm missing a lot of necessary info. There's an intriguing option in batchtools, batchtools.verbose, but I have no idea of how to set it when working with doFuture. The entire contents of the log file is below:

Error: package or namespace load failed for ‘utils’: .onLoad failed in loadNamespace() for 'utils', details: call: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) error: cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory' Error: package or namespace load failed for ‘stats’: .onLoad failed in loadNamespace() for 'utils', details: call: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) error: cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory' During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found Error: .onLoad failed in loadNamespace() for 'utils', details: call: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) error: cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory' Execution halted Warning message: system call failed: Cannot allocate memory

HenrikBengtsson commented 3 years ago

It looks like you've got a corrupt R installing on those machines, use non-standard package library paths (e.g. invalid settings in R_LIBS, ...), or something along that way.

Setting R options specific to future backends is done via options() completely independent of the foreach code written.

HenrikBengtsson commented 3 years ago

Alternatively, you request too little memory for R to even launch.

rimorob commented 3 years ago

Yeah, just saw buried in an example. Strange if it’s R, the machines are being provisioned by a script. What’s more strange is this always happens on a specific run - the first 5 foreach calls always succeed. Moreover, since the machines are configured to stay up for 15 minutes, they are almost certainly the same machines that have worked for the previous foreach runs. Lastly, even uname fails, and that, I assume, is run first thing to determine linux version. My current bet is, I’m not properly setting memReq and some node runs out of memory. Well, I’ll set every debugging option to TRUE and see if I can’t get a bit more detail.

On Oct 5, 2020, at 11:19 AM, Henrik Bengtsson notifications@github.com wrote:

It looks like you've got a corrupt R installing on those machines, use non-standard package library paths (e.g. invalid settings in R_LIBS, ...), or something along that way.

Setting R options specific to future backends is done via options() completely independent of the foreach code written.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HenrikBengtsson/doFuture/issues/51#issuecomment-703701423, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFQNOCT4WGN2N5Y3K4NAPLSJHPXPANCNFSM4SEZ5JZA.

HenrikBengtsson commented 3 years ago

I don't know where 'uname' is called, but it's probably R itself.

You can also add your own poor man's debug output to your job script template to collect more info on the host and the job.

HenrikBengtsson commented 3 years ago

Slurm job accounting db most likely also have some clues.

rimorob commented 3 years ago

Closing this issue as the "log output" part is resolved. Will start another, closer to the actual problem.