Closed jimmybgammyknee closed 5 years ago
Jimmy; Thanks for the detailed report and apologies about the issue. The error is that bcbio is having trouble creating the temporary directory it uses for transactional files. The specific place it is dying indicates it's having trouble accessing the current work directory. Is it possible the SLURM node that job got launched on doesn't have access to the directory the job is running on? If it's not on the shared filesystem or the shard filesystem is unevenly mounted across worker nodes, that could cause this issue. Hope that helps with debugging.
Thanks Brad, Definitely no issues with that work directory and access. Is it possible that im overloading the number of concurrent jobs which creates issues with writing to the current working directory? Ive downgraded the number of jobs to -n 8 and it seems to be working fine (albeit slower).
Jimmy; That's definitely unexpected, but might be due to stress on your shared filesystem. The only different when using more cores is that there are more processes concurrently reading and writing on that filesystem. Is it possible this overwhelms your shared filesystem? It might be worth discussing with the cluster folks who've set up your system to see if there might be an issue under higher loads and any ideas for working around it. From the bcbio side, if your global temporary directory is unstable and you have suitable local disk you could use those as a temporary directory:
https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#temporary-directory
Sorry to not be able to help more without knowing more about the cluster but hope this helps.
No problem Brad, ill try that. Thanks again for getting back to me
Hi Brad,
Sorry if this is a fairly basic error but I can't seem to get my exome pipeline to work on our local cluster. Im using the slurm scheduling system, requesting more resources than needed for the process within a slurm script, just to make sure I have enough to run (32CPUs/125G mem - running -n16).
Is this issue related to the way im submitting the job via slurm?
Output error is attached: glaucoma_exomes.log
Grateful for any help, Jimmy