coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

GPU tasks failed with no space left on device #66

Closed mrocklin closed 4 years ago

mrocklin commented 4 years ago
CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolverMg.so.10.2.0.243: no space left on device
--

https://us-east-2.console.aws.amazon.com/ecs/home?region=us-east-2#/clusters/_dask_dev/tasks/5147575aff4e42cfaef46c2203311c68/details

cc @necaris

I can trigger this with

coiled.Cluster(configuration="mrocklin/pytorch-optuna")
necaris commented 4 years ago

Based on https://forums.aws.amazon.com/thread.jspa?messageID=927200 and an experiment I ran with necaris/sizetest which is 4.17GB in size (rather than mrocklin/pytorch-optuna's 5.17GB), my guess is that we're hitting the 10GB Docker layer limit for Fargate 1.3.0 tasks.

Our fix for this (using Fargate 1.4.0, which allows 20GB of space) is already on sandbox and will be on beta shortly so we can retest. I think this does increase the priority of our wanting to run schedulers on EC2, though.

necaris commented 4 years ago

@mrocklin FWIW I've just tested this on beta after deploying and :crossed_fingers: it seems to be launching fine.