OOM errors creating an endpoint for LLMs

Describe the bug

It seems like the model serving endpoints don't utilize the NVME drives effectively. When I try to serve a 13B parameter LLM (my model.tar.gz is ~42GB on s3) I get errors that the disk is out of space. The endpoint fails to create.

I think the root of the issue is that the endpoint is trying to put too much stuff into the / disk volume instead of using the NVME which is located on /tmp.

Screenshots or logs

Here's all my log events for the endpoint startup failure.

``` ERROR - Failed to save the model-archive to model-path "/.sagemaker/mms/models". Check the file permissions and retry. ``` ``` Traceback (most recent call last): File "/opt/conda/bin/model-archiver", line 8, in sys.exit(generate_model_archive()) File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging.py", line 63, in generate_model_archive package_model(args, manifest=manifest) File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging.py", line 44, in package_model ModelExportUtils.archive(export_file_path, model_name, model_path, files_to_exclude, manifest, File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging_utils.py", line 262, in archive ModelExportUtils.archive_dir(model_path, mar_path, File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging_utils.py", line 308, in archive_dir shutil.copy(file_path, dst_dir) File "/opt/conda/lib/python3.8/shutil.py", line 418, in copy copyfile(src, dst, follow_symlinks=follow_symlinks) File "/opt/conda/lib/python3.8/shutil.py", line 275, in copyfile _fastcopy_sendfile(fsrc, fdst) File "/opt/conda/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile raise err from None File "/opt/conda/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile sent = os.sendfile(outfd, infd, offset, blocksize) ``` ``` OSError: [Errno 28] No space left on device: '/opt/ml/model/pytorch_model-00003-of-00005.bin' -> '/.sagemaker/mms/models/model/pytorch_model-00003-of-00005.bin' ``` ``` Traceback (most recent call last): File "/usr/local/bin/dockerd-entrypoint.py", line 23, in serving.main() File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/serving.py", line 34, in main _start_mms() File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 212, in call raise attempt.get() File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 247, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.8/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 200, in call attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/serving.py", line 30, in _start_mms mms_model_server.start_model_server(handler_service=HANDLER_SERVICE) File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/mms_model_server.py", line 85, in start_model_server _adapt_to_mms_format(handler_service, model_dir) File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/mms_model_server.py", line 138, in _adapt_to_mms_format subprocess.check_call(model_archiver_cmd) File "/opt/conda/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) ``` ``` subprocess.CalledProcessError: Command '['model-archiver', '--model-name', 'model', '--handler', 'sagemaker_huggingface_inference_toolkit.handler_service', '--model-path', '/opt/ml/model', '--export-path', '/.sagemaker/mms/models', '--archive-format', 'no-archive', '--f']' returned non-zero exit status 1. ```

I also injected a df -kh call to see what the disk utilization was and got:

Filesystem      Size  Used Avail Use% Mounted on
overlay          52G   27G   26G  52% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              30G   20K   30G   1% /dev/shm
/dev/nvme1n1    550G  948M  521G   1% /tmp
/dev/nvme0n1p1   52G   27G   26G  52% /etc/hosts
tmpfs            32G   12K   32G   1% /proc/driver/nvidia
devtmpfs         32G     0   32G   0% /dev/nvidia0
tmpfs            32G     0   32G   0% /proc/acpi
tmpfs            32G     0   32G   0% /sys/firmware

So storing things at /.sagemaker/... or at /opt/ml/... are both going to fail. It needs to be on the nvme at /tmp

System information Specifics of my requirements.txt, inference.py, and invocation code in the details.

requirements.txt in model.tar.gz/code ``` accelerate==0.16.0 transformers==4.26.0 bitsandbytes==0.37.0 ``` inference.py in model.tar.gz/code ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch def model_fn(model_dir): model = AutoModelForSeq2SeqLM.from_pretrained( model_dir, device_map="auto", load_in_8bit=True, cache_dir="/tmp/model_cache/", ) tokenizer = AutoTokenizer.from_pretrained( model_dir, cache_dir="/tmp/model_cache/", ) return model, tokenizer def predict_fn(data, model_and_tokenizer): model, tokenizer = model_and_tokenizer inputs = data.pop("inputs", data) parameters = data.pop("parameters", None) input_ids = tokenizer(inputs, return_tensors="pt").input_ids if parameters is not None: outputs = model.generate(input_ids, **parameters) else: outputs = model.generate(input_ids) prediction = tokenizer.decode(outputs[0], skip_special_tokens=True) return [{"generated_text": prediction}] ``` My invocation code: ```python import boto3 from sagemaker.huggingface import HuggingFaceModel import sagemaker default_bucket = "my_bucket" boto_session = boto3.Session(profile_name="sagemaker", region_name="us-west-2") sagemaker_session = sagemaker.Session(boto_session=boto_session, default_bucket=default_bucket) huggingface_model = HuggingFaceModel( model_data=f"s3://{default_bucket}/sagemaker/google/flan-t5-xxl/model.tar.gz", role="arn:aws:iam::my_role", env={'HF_TASK':'text-generation'}, sagemaker_session=sagemaker_session, transformers_version="4.17", pytorch_version="1.10", py_version="py38", ) predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.g5.4xlarge", model_data_download_timeout=300, container_startup_health_check_timeout=600, ) data = {"inputs": "What color is a banana?"} predictor.predict(data) ```

Additional details

I've also tried to alter the SAGEMAKER_BASE_DIR env variable to be in /tmp but it just gives an error about an model-dir directory.

aws / sagemaker-inference-toolkit