Closed Jeltje closed 8 years ago
I can't get onto the worker, I get permission denied when I try to ssh, via cgcloud, or normally.
cgcloud ssh --namespace /jeltje/ toil-worker
But I think I know what the issue is so I'll go ahead and shut down the node. Every job needs an explicit disk requirement with caching, where previously whole nodes were assigned to jobs via just core allocation. I'll fix this in 2.0.10.
Thanks! What version was 'previously'? This production run is nearing its deadline, so if I can finish it on an earlier version of toil/toil-scripts, that would be great. Unless you plan on pushing out 2.0.10 today?
And sorry about the missing access, I didn't start the cluster with --@@ developers set.
What version was 'previously'?
A version of Toil that doesn't have caching, not sure what version that is. I'm pushing out 2.0.10 now though.
Previous known good I'm pretty sure was 2.0.8. Once 2.0.10 has been run in production and there is a 2.0.10 based docker tested (I'll be happy to do this) then we should standardize on it for production runs going forward so we can shift energy toward other areas.
Previous known good I'm pretty sure was 2.0.8
For RNA-seq there is no difference between 2.0.8 and 2.0.9.
2.0.10 based docker tested
I'll talk to @JakeNarkizian and @alex-hancock about updating the docker pipelines to 2.0.10 today.
In my attempt to reproduce https://github.com/BD2KGenomics/toil-scripts/issues/438 I ran 29 samples on seven c3.8xlarge nodes.
Command:
toil-rnaseq run aws:us-west-2:jeltje-rnaseq-ckcc77 --batchSystem=mesos --mesosMaster mesos-master:5050 --config /home/mesosbox/shared/srconfig.txt --manifest /home/mesosbox/shared/srmanifest.txt
After a few hours I started getting
no space left on device
errors fromtar
commands, as well as errors like this:StorageDataError: BotoClientError: Out of space for destination file /var/lib/toil/toil-bb25105a-3ae0-4357-a469-186974f25874/cache-bb25105a-3ae0-4357-a469-186974f25874/.NjYwM2MyNmQtZTU0Ny00MDQyLTgyZTAtODUxNjFkZTM1MGFm
And indeed, all my workers looked like this:
583G /mnt/ephemeral/var/lib/toil/toil-bb25105a-3ae0-4357-a469-186974f25874/cache-bb25105a-3ae0-4357-a469-186974f25874
It seems that the cached files take over so much of the/mnt/ephemeral
drive that there's not enough space left for output.It took another 5 hours for the workflow to terminate (with 141 failed jobs). After this failure, the caches on the workers were not cleared.
Only two of the samples finished, the rest failed.
I will leave one of the workers running in case you need more info.
srconfig.txt srmanifest.txt (large) sr_log.txt