variant2 pipeline (via slurm) failing during align_prep

bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

https://bcbio-nextgen.readthedocs.io

MIT License

994 stars 354 forks source link

variant2 pipeline (via slurm) failing during align_prep #2534

Closed jimmybgammyknee closed 5 years ago

jimmybgammyknee commented 6 years ago

Hi Brad,

Sorry if this is a fairly basic error but I can't seem to get my exome pipeline to work on our local cluster. Im using the slurm scheduling system, requesting more resources than needed for the process within a slurm script, just to make sure I have enough to run (32CPUs/125G mem - running -n16).

Is this issue related to the way im submitting the job via slurm?

Output error is attached: glaucoma_exomes.log

Grateful for any help, Jimmy

chapmanb commented 6 years ago

Jimmy; Thanks for the detailed report and apologies about the issue. The error is that bcbio is having trouble creating the temporary directory it uses for transactional files. The specific place it is dying indicates it's having trouble accessing the current work directory. Is it possible the SLURM node that job got launched on doesn't have access to the directory the job is running on? If it's not on the shared filesystem or the shard filesystem is unevenly mounted across worker nodes, that could cause this issue. Hope that helps with debugging.

jimmybgammyknee commented 6 years ago

Thanks Brad, Definitely no issues with that work directory and access. Is it possible that im overloading the number of concurrent jobs which creates issues with writing to the current working directory? Ive downgraded the number of jobs to -n 8 and it seems to be working fine (albeit slower).

chapmanb commented 6 years ago

Jimmy; That's definitely unexpected, but might be due to stress on your shared filesystem. The only different when using more cores is that there are more processes concurrently reading and writing on that filesystem. Is it possible this overwhelms your shared filesystem? It might be worth discussing with the cluster folks who've set up your system to see if there might be an issue under higher loads and any ideas for working around it. From the bcbio side, if your global temporary directory is unstable and you have suitable local disk you could use those as a temporary directory:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#temporary-directory

Sorry to not be able to help more without knowing more about the cluster but hope this helps.

jimmybgammyknee commented 6 years ago

No problem Brad, ill try that. Thanks again for getting back to me