Closed kodingkoning closed 1 year ago
A couple things:
Unfortunately, I don't think you can change --consCores
and then run --restart
and have it take any effect. Only the Toil-specific options work in this case. (I agree it'd be really nice to be able to do this sort of thing).
0 jobs are running, 22 jobs are issued and waiting to run
means you have no jobs scheduled or running -- so nothing is happening at all. There are two possible reasons I can think for this: 1) your cluster is too busy or 2) the jobs in the queue are requesting too much memory for any node on your cluster.
If it's 1), you should be able to see that pretty easily: your jobs will be PD
in squeue
. If it's 2), and you can share your whole log, we could probably get a better idea what's going on.
2) wouldn't surprise me as, since the last release, Cactus is being rather conservative about estimating memory for each job. I'm going to ratchet it down a bit in the next release, hopefully, as I accumulate more test stats. There are some work-arounds ( like --consMemory
), but without the logs it's hard to say if they apply.
Finally: I've only had access to a Slurm cluster for testing for a few weeks. The current Cactus release still doesn't work properly on it. The next release (eta before July) will be much better Slurm support for both progressive and pangenome versions.
Starting a new run and changing --consCores
to 16 worked!
There hadn't been any jobs waiting in the slurm queue, so I think they had failed to submit due to the -1 issue.
Good to know about the Slurm cluster testing!
Closing -- fix was a valid input to --consCores
I am running Cactus with 27 fungal species, and I can't get it to complete. It keeps stopping after giving a message about disk usage and then saying it has jobs waiting to run. I'm on a slurm cluster, and there aren't any more jobs queued to run. No files have been modified in the working directory after the successful run of paffy and the disk usage message.
I did run RepeatMasker on my input. This is the command I used for cactus:
cactus --batchSystem slurm --consCores -1 ./${WORKING_DIR} ${input_file} ${output_file}
Any ideas for why it's getting stuck at this point? Is the disk usage message a warning or an error, and is it causing the failure?
For context, it is now 12:54 in my timezone, so it has been about an hour since I got the "used more disk than requested" message.
The last few lines of my output:
Update:
I realized that
--conCores -1
might not be a good value, so I changed it to the number of cores on the nodes I'm using. I tried restarting the job and running it in a new location with an increase on--defaultDisk
as well. When I look a little ways back in the output, I see this error:--cpus-per-task
isn't a valid argument for cactus, so I don't know how to change that, as it seems to be for the jobs that cactus spawns.