bentoml / aws-ec2-deploy

Fast model deployment on AWS EC2
Apache License 2.0
14 stars 9 forks source link

Default volume size of 8GiB not enough for GPU enabled images #15

Closed reuben closed 2 years ago

reuben commented 2 years ago

Example error during docker pull of the built BentoML image from ECR:

$ docker pull foo/bar
baz: Pulling from quux
852e50cd189d: Pull complete
a6236801494d: Pull complete
679c171d6942: Pull complete
92e96ffce2e7: Pull complete
8cf573657c13: Pull complete
23ae19020a76: Pull complete
60e4f651dd51: Pull complete
04a62a11f127: Pull complete
5abcca069c9d: Pull complete
4e74c2e851e0: Pull complete
9cc2243a703a: Pull complete
894729620c39: Pull complete
7174c255b26a: Pull complete
9aa8cf1c5589: Pull complete
2fd8dfe28882: Pull complete
f8ad6a602c29: Pull complete
555ec884f6d1: Pull complete
709ce7478d94: Extracting [==================================================>]  552.4MB/552.4MB
aa46983e0b44: Download complete
bb766d054302: Download complete
791881576c0f: Download complete
013fbf704190: Download complete
failed to register layer: Error processing tar file(exit status 1): write /opt/conda/lib/python3.7/site-packages/tensorboard_data_server/bin/server: no space left on device
jjmachan commented 2 years ago

Thanks for opening the issue and pointing this out. I'm looking into this. Just curious, what branch are you using for your deployments?

reuben commented 2 years ago

I'm using v0.13.1

One note that might be related (and my own fault) is that after enabling GPU inference I switched to an Amazon Linux 2 "Deep Learning Base" AMI (ami-0ca2d09ec9ba076b3 on us-east-2), because I wanted to keep extra dependencies to a minimum. Maybe that was the reason? I couldn't find any specific guidance on the BentoML docs. Right now I switched to one of our own AMIs and that worked around the problem. Thanks for taking a look!

reuben commented 2 years ago

I haven't hit this again so I think it was just me being too clever with the AMIs.