aws-samples / amazon-sagemaker-local-mode

Amazon SageMaker Local Mode Examples
MIT No Attribution
242 stars 59 forks source link

No space left on device #14

Closed BS-98 closed 2 years ago

BS-98 commented 2 years ago

Hi,

I try to make Batch Transform with YOLOv5 model in local mode. I use t2.large EC2 instance. My code:

from sagemaker.pytorch.model import PyTorchModel
from sagemaker.local import LocalSession

def main(model_path, input_data, output_path, instance_type):
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}

    yolov5_model = PyTorchModel(model_data=model_path,
                               framework_version='1.7',
                               role='my_role',
                               source_dir='.',
                               entry_point='aws-inference-yolov5.py',
                               py_version='py3',
                               sagemaker_session=sagemaker_session,
                               dependencies=['./data', './models', './utils'])

    yolov5_transformer = yolov5_model.transformer(instance_count=1, instance_type=instance_type, max_concurrent_transforms=1, max_payload=1, output_path=output_path)
    yolov5_transformer.transform(data=input_data, content_type='application/x-image')

if __name__ == "__main__":
    model_path = 'file:///home/ubuntu/models/yolov5.tar.gz'
    input_data_path = "file:///home/ubuntu/data/imgs/"
    output_res_path = "file:///home/ubuntu/data/output/"
    instance_type = 'local'

    main(model_path, input_data_path, output_res_path, instance_type)

And I get an error:

WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/ubuntu/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

failed to register layer: Error processing tar file(exit status 1): mkdir /usr/lib/x86_64-linux-gnu/perl/5.26.1/auto/Tie: no space left on device
Traceback (most recent call last):
  File "aws-yolov5.py", line 42, in <module>
    main(model_path, input_data_path, output_res_path, instance_type)
  File "aws-yolov5.py", line 33, in main
    yolov5_transformer.transform(data=input_data, content_type='application/x-image')
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/transformer.py", line 210, in transform
    self.latest_transform_job = _TransformJob.start_new(
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/transformer.py", line 373, in start_new
    transformer.sagemaker_session.transform(**transform_args)
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/session.py", line 2558, in transform
    self.sagemaker_client.create_transform_job(**transform_request)
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 241, in create_transform_job
    transform_job.start(TransformInput, TransformOutput, TransformResources, **kwargs)
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/local/entities.py", line 308, in start
    self.container.serve(self.primary_container["ModelDataUrl"], environment)
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/local/image.py", line 289, in serve
    _pull_image(self.image)
  File "/home/ubuntu/PoC/venvs/sm-local/lib/python3.8/site-packages/sagemaker/local/image.py", line 1098, in _pull_image
    subprocess.check_output(pull_image_command.split())
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', '763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:1.7-cpu-py3']' returned non-zero exit status 1.

This info: failed to register layer: Error processing tar file(exit status 1): mkdir /usr/lib/x86_64-linux-gnu/perl/5.26.1/auto/Tie: no space left on device is quite strange in my opinion. I have space on my disk:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       7.7G  5.3G  2.5G  68% /
devtmpfs        3.9G     0  3.9G   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           795M  848K  795M   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/loop0       27M   27M     0 100% /snap/amazon-ssm-agent/5163
/dev/loop1       56M   56M     0 100% /snap/core18/2253
/dev/loop2       68M   68M     0 100% /snap/lxd/21835
/dev/loop3       68M   68M     0 100% /snap/lxd/22753
/dev/loop4       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop5       56M   56M     0 100% /snap/core18/2344
/dev/loop6       44M   44M     0 100% /snap/snapd/15177
/dev/loop7       62M   62M     0 100% /snap/core20/1242
/dev/loop8       45M   45M     0 100% /snap/snapd/15534
/dev/loop9       62M   62M     0 100% /snap/core20/1405
tmpfs           795M     0  795M   0% /run/user/1000

Does somone have any idea how to solve this problem? I would be grateful for any help.

eitansela commented 2 years ago

Hi,

When you install Docker on your machine, it doesn't allocate all the space available for the Docker installation, and i tmight use older images which are big.

Maybe try doing docker system prune To delete all old images etc.

Note that Batch Transform will copy all the images to the Docker image, perform inference, and then copy to S3. So for debugging with few images, that make sense. If you have big amount of images (e.g. 100GB or more) than better use SageMaker in the Cloud, and not local mode.