rnaseq-cgl on aws gets No space left on device errors

Jeltje commented 8 years ago

In my attempt to reproduce https://github.com/BD2KGenomics/toil-scripts/issues/438 I ran 29 samples on seven c3.8xlarge nodes.

toil (3.3.0)
toil-scripts (2.0.9)

Command: toil-rnaseq run aws:us-west-2:jeltje-rnaseq-ckcc77 --batchSystem=mesos --mesosMaster mesos-master:5050 --config /home/mesosbox/shared/srconfig.txt --manifest /home/mesosbox/shared/srmanifest.txt

After a few hours I started getting no space left on device errors from tar commands, as well as errors like this: StorageDataError: BotoClientError: Out of space for destination file /var/lib/toil/toil-bb25105a-3ae0-4357-a469-186974f25874/cache-bb25105a-3ae0-4357-a469-186974f25874/.NjYwM2MyNmQtZTU0Ny00MDQyLTgyZTAtODUxNjFkZTM1MGFm

And indeed, all my workers looked like this: 583G /mnt/ephemeral/var/lib/toil/toil-bb25105a-3ae0-4357-a469-186974f25874/cache-bb25105a-3ae0-4357-a469-186974f25874 It seems that the cached files take over so much of the /mnt/ephemeral drive that there's not enough space left for output.

It took another 5 hours for the workflow to terminate (with 141 failed jobs). After this failure, the caches on the workers were not cleared.

Only two of the samples finished, the rest failed.

I will leave one of the workers running in case you need more info.

srconfig.txt srmanifest.txt (large) sr_log.txt

jvivian commented 8 years ago

I can't get onto the worker, I get permission denied when I try to ssh, via cgcloud, or normally.

cgcloud ssh --namespace /jeltje/ toil-worker

But I think I know what the issue is so I'll go ahead and shut down the node. Every job needs an explicit disk requirement with caching, where previously whole nodes were assigned to jobs via just core allocation. I'll fix this in 2.0.10.

Jeltje commented 8 years ago

Thanks! What version was 'previously'? This production run is nearing its deadline, so if I can finish it on an earlier version of toil/toil-scripts, that would be great. Unless you plan on pushing out 2.0.10 today?

And sorry about the missing access, I didn't start the cluster with --@@ developers set.

jvivian commented 8 years ago

What version was 'previously'?

A version of Toil that doesn't have caching, not sure what version that is. I'm pushing out 2.0.10 now though.

rcurrie commented 8 years ago

Previous known good I'm pretty sure was 2.0.8. Once 2.0.10 has been run in production and there is a 2.0.10 based docker tested (I'll be happy to do this) then we should standardize on it for production runs going forward so we can shift energy toward other areas.

jvivian commented 8 years ago

Previous known good I'm pretty sure was 2.0.8

For RNA-seq there is no difference between 2.0.8 and 2.0.9.

2.0.10 based docker tested

I'll talk to @JakeNarkizian and @alex-hancock about updating the docker pipelines to 2.0.10 today.

BD2KGenomics / toil-scripts

rnaseq-cgl on aws gets No space left on device errors #447