USCbiostats / slurmR

slurmR: A Lightweight Wrapper for Slurm
https://uscbiostats.github.io/slurmR/
Other
57 stars 12 forks source link

“An error has occurred when calling `silent_system2`:” #29

Open thistleknot opened 3 years ago

thistleknot commented 3 years ago

https://stackoverflow.com/questions/65402764/slurmr-trying-to-run-an-example-job-an-error-has-occurred-when-calling-silent

I setup a slurm cluster and I can issue a srun -N4 hostname just fine.

I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")

I'm following the second example 3: https://github.com/USCbiostats/slurmR

here are my files

cat slurmR.R

library(doParallel)
library(slurmR)

cl <- makeSlurmCluster(4)

registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind)

StopCluster(cl)
print(m)

cat rscript.slurm

#!/bin/bash
#SBATCH --output=slurmR.out

cd /mnt/nfsshare/tankpve0/
Rscript --vanilla slurmR.R

cat slurmR.out

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
slurmR default option for `tmp_path` (used to store auxiliar files) set to:
  /mnt/nfsshare/tankpve0
You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.
Submitting job... jobid:18.
Slurm accounting storage is disabled
Error: An error has occurred when calling `silent_system2`:
Warning: An error was detected before returning the cluster object. If submitted, we will try to cancel the job and stop the cluster object.
Execution halted
gvegayon commented 3 years ago

This seems to be an issue with Slurm's configuration. See if you could try the following:

library(slurmR)
Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

That is the bare minimum. Creating a cluster object may be more complicated.

ekernf01 commented 3 years ago

I'm also seeing this issue. It appears with the bare minimum example you just posted. When I set plan = "none" and submit the job by hand, I see the jobs on squeue, the log shows no errors, and the answers are all there in the R data files. But when I go to collect, I get

No job found. This may be a false negative as the job may still be on it's way to be submitted.. Waiting 10 seconds before retry.
Error: No job found. This may be a false negative as the job may still be on it's way to be submitted.

I just set up slurm on my laptop for testing, so it certainly could be a problem with my configuration. But given that it all ran and the answers are right there as expected, it seems like Slurm_collect ought to be able to find them.

Edit: I'm using R 4.0.0, slurm-wlm 17.11.2, ubuntu 18.04, slurmR 0.4.2.

gvegayon commented 3 years ago

Thanks, @ekernf01, I'll try to reproduce your error using Docker. I'm not sure what could be causing it. In the case of @thistleknot, I believe this is an issue with the setup of his cluster. I currently don't have access to a cluster that allows using ssh between nodes (which is what makeSlurmCluster relies on). I am very aware of these issues and will try to solve them ASAP.

ekernf01 commented 3 years ago

If it's helpful in setting up the container, I used this guide to set up my slurm.

https://blog.llandsmeer.com/tech/2020/03/02/slurm-single-instance.html

ekernf01 commented 3 years ago

Can't stop thinking of futurama. https://futurama.fandom.com/wiki/Slurm

edisto69 commented 3 years ago

I am experiencing the same issue. I built an Odroid XU4-based cluster (an XU4 front-end and 12 MC1s as the nodes). When I submit:

job<-SlurmEvalQ(slurmR::WhoAmI(),njobs=20,plan="submit")

It says the job was submitted. Looking at slurmctld.log, I can see the jobs were submitted to the 12 nodes, and the remaining 8 jobs assigned as the first jobs were completed, and subsequently completed. But, when I enter "job" or "res<-Slurm.collect(job), I get:

Slurm accounting storage is disabled Error: An error has occurred when calling 'silent_system2':

The same issue occurs with the minimal Slurm_lapply example above. Any suggestions will be greatly appreciated!

The system is connected to an NFS server, but I am running R on the front-end, not on the server.

gvegayon commented 3 years ago

@edisto69 @ekernf01 @thistleknot I believe you may have found a bug. It could be still that your systems may have an issue or two with the Slurm config (which I will check ASAP to see how to give it the right treatment), yet slurmR was supposed to be more explicit regarding the type of error. It turned out that I was not capturing stderr when needed, which now I am.

I would appreciate it if you could install this version instead, re-run your code, and report back whatever you see.

To install this version, you can either do use git:

git clone https://github.com/USCbiostats/slurmR/tree/issue029
R CMD INSTALL slurmR

Or download the zip, unzip it, and then install, e.g.,

wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029

I appreciate your help! cc @USCbiostats/core-c

edisto69 commented 3 years ago

Thanks for following up!

Now, when I run:

library(slurmR) Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

I get the more specific error message:

Error: An error has occurred when calling system2("sacct", flags, stdout = TRUE, stderr = TRUE) Slurm accounting storage is disabled

edisto69 commented 3 years ago

I found one reason I was having an issue (by looking at slurmd.log). I have a single user on all nodes and on the front end, but they don't have a shared home folder...I'm trying to figure out how/if I can have the users share a home folder on the NFS share.

gvegayon commented 3 years ago

Thank you @edisto69, I just pushed an update. Could you try to install it again? Thanks

edisto69 commented 3 years ago

I am probably messing you up by changing things...I'm still working on getting R installed on the NFS server so all the nodes have access, but I have tried it a few times after a new R installation using:

install.packages("devtools") devtools::install_github("USCbiostats/slurmR")

And I get the generic error:

Error: An error has occurred when calling 'silent_system2':

I hope to have things configured by the end of the week, and I'll try it again.

edisto69 commented 3 years ago

Sorry for spamming the thread...I am pretty sure that my configuration is good now. I just ran the rslurm::slurm_apply example, and got back the expected results.

Running the minimal example that you gave above, I still get:

Error: An error has occurred when calling 'silent_system2':

But the slurmr-job directory now has no errors in the '02-output-' files, and has '03-answer-' files and 'X_0001.rds' to 'X_0004.rds' (now we have Futurama and the X-files...).

gvegayon commented 3 years ago

Hey @edisto69, thanks for trying that. The issue is that you got the bugged version, not the patched one. You can either install the updated version like this:

wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029

Or, if you want to use devtools, like this

devtools::install_github("USCBiostats/slurmR", ref = "issue029")

I'll now try to replicate the issue using docker.

edisto69 commented 3 years ago

Well...it is different.

I ran:

library(slurmR) slurmR::Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

It now says that it cannot create the slurmr job file in the users home directory (which is an NFS mount) because permission is denied, but I can access the directory from the terminal, and rslurm::slurm_map() has no issues setting up the job directory.

For my slurm_map() scripts I have been using /home/user/work as my wd, where 'user' is a link to the NFS mount home directory, and work is a link in that directory to a different NFS mount folder.

Setting the same wd for the above script resulted in the same error.

gvegayon commented 3 years ago

Thank you very much, @edisto69, I really appreciate all the time you are giving me! I think it would be great if we could talk more at length to see what's going. Would you be willing to have a conference call to talk about this? If so, feel free to email me at g.vegayon@gmail.com.

Regarding the docker image, @ekernf01, I was able to build one using an existing image with Slurm. It is available at https://hub.docker.com/repository/docker/uscbiostats/slurmr-dev, and the instructions (partial, though) are here.

kgoldfeld commented 2 years ago

Has this problem been resolved? I just started getting this message occasionally (that is not consistently) when submitting the same job:

Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues/29
Error in UseMethod("get_tmp_path") : 
  no applicable method for 'get_tmp_path' applied to an object of class "c('integer', 'numeric')"
Calls: Slurm_lapply ... wait_slurm.integer -> status -> status.default -> sacct_ -> get_tmp_path

Does the latest development version slurmR fix this?

Follow up: I got the development version of slurmR installed on the HPC, but still getting the same error ... any ideas?

jobstdavid commented 2 years ago

Unfortunately, I have the same problem as @kgoldfeld

`Submitted batch job 892035
Submitting job...Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues/29
Error in UseMethod("get_tmp_path") : 
  no applicable method for 'get_tmp_path' applied to an object of class "c('integer', 'numeric')"
Calls: Slurm_sapply ... wait_slurm.integer -> status -> status.default -> sacct_ -> get_tmp_path
In addition: Warning messages:
1: In normalizePath(file.path(tmp_path, job_name)) :
  path[1]="/home/jobst/test/slurmr-job-9c9aa2b50d464": No such file or directory
2: `X` is not a list. The function will coerce it into one using `as.list` 
Execution halted`

Does there already exist a solution? This would be great!!!

gvegayon commented 2 years ago

Hey @jobstdavid and @kgoldfeld (and others!), I just pushed what I think is a fix to the master branch. I'd appreciate you installing the package and giving it a try.

kgoldfeld commented 2 years ago

@gvegayon - I installed the package on our HPC, and did some quick tests. It seems like things are working again - though I will keep you posted in case the errors reappear. Thanks so much for the fix.