awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
995 stars 231 forks source link

SageMaker fails to register model archive in .tar.gz format for MMS running behind a multi-model endpoint #928

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi AWS team. I love multi-model-server by the way.

My team wanted to implement MMS as a backend for Multi-Model Endpoints running on SageMaker, while allowing for custom model handlers for each model we deploy on them. SageMaker Inference Toolkit is cool, but doesn't allow for this functionality - it only allows a single default handler to be used for all models on the MME, as far as I've seen. So we were hoping to use model-archiver to archive models, along with handlers (and requirements files) for each model. This works beautifully with MMS on a SageMaker notebook instance, using .mar export format - we managed to accomplish exactly what we wanted. Then we implemented the same in a docker image pushed to ECR and referenced as the multi-model inference image for the multi-model endpoints in SageMaker. I've seen one or two tutorials for how to do this, but nailing the configuration was difficult - and could use a little more explicit documentation ;)

Anyway. We got the endpoint running as a SageMaker MME, and all looked well. Initially we wanted to put the .mar archive files in S3 and then use that as the model data location for the MME, but eventually discovered that SageMaker only supports .tar.gz model files. This is where our problems began - we used model-archiver and specified that the export format should be tgz, producing our model archives in .tar.gz format instead of .mar. We added these to that S3 model data path we had previously - expecting that MMS would be able to identify that the .tar.gz is a model archive and not just a model artifact. Unfortunately it seems SageMaker registers each model with MMS by specifying that the .tar.gz file referred to is the model artifact instead of the archive. We didn't set a default handler in config.properties since we want the handler to be specified in the model archive, and so when invoking the endpoint (i.e. registering the model with MMS), we get the error that there is no handler defined for the model - even though there is certainly a handler defined in the archive .tar.gz. More specifically, this is the cell that is ran (referring to the MME previously created, on which MMS is running): image And then, this is the error message received: image

I've read extensively through the http java frontend of MMS, and I think that this can be fixed by amending the downloadModel method in ModelArchiver.java: image The error is caused in the archive.validate() call at the bottom of the below snippet, because archive.getManifest().getModel().getHandler() = null after the final if-else statement in the snippet doesn't give it a value (it seems the method's argument handler=null too when registerModel is called by SageMaker). image I guess all that is needed is extracting the tarball and checking to see if there's a MAR-INF directory therein - and if there is, then the .tar.gz file should be treated as a model archive instead of a model artifact.

On a previous attempt, I found that SageMaker registered the model with some arbitrary name, instead of the actual model name specified on the .tar.gz archive. I saw this in the CloudWatch logs: wlm.ModelManager - Model 193ac6...987 loaded.

Also, when running MMS 'locally' on a SageMaker notebook instance, it seems that whenever I put .tar.gz model archive files in the model store, they aren't detected by MMS and that only .mar archives are? Considering the tgz option is available for export in model-archiver I'm surprised by this.

Please advise if I'm missing something obvious, or if my proposed code changes can be implemented. (And yes haha I'm sure that I included a handler in the model archive).

Many thanks in advance, and my sincere apologies for the long essay! Dean

P.S. Can torchserve serve .mar models packaged by model-archiver? It seems also that in the torchserve examples they archive the model into a .mar file and then tar that into .tar.gz.

ghost commented 4 years ago

After further thought, I think first prize would actually be if the SageMaker team allowed .mar files to be referred to as the TargetModel when invoke_endpoint is called. I think this would mean no code changes are required to MMS - although there may still need to be investigation to ensure the .tar.gz archive format is supported.

ckang244 commented 4 years ago

@dean-cpi , I'm trying to do a similar thing and was poking around previous issues. According to #698 it looks like you need to run tar on the .mar file and then upload to s3. Curious to know if that works for you.

ghost commented 4 years ago

@ckang244, I did give that a shot - it's also the method used in the tutorial for TorchServe (which is built on MMS) - unfortunately it still doesn't seem to resolve the issue and the same problem persists. I discussed this with AWS Support and they asked to ensure the MAR-INF directory is in the root of the .tar.gz archive. Haven't tried that yet but you could give that a go if you're stuck? Without the .mar support in SageMaker, and because SageMaker Model Monitor doesn't support multi-model endpoints, we steered away from them. Which is disappointing because it has such potential.

ckang244 commented 4 years ago

@dean-cpi Thanks for your input. I basically reached the same conclusion as you already. The core functionality of MMS not being supported in Sagemaker is strange to me, so I hope the teams maintaining this will take note of it.

chrisella commented 3 years ago

I'm experiencing similar issues now, specifically that my endpoint returns Model version is not defined despite every combination of the tar.gz / mar file I can think of. It's infuriating that AWS provide no clear examples anywhere of this either.

Update: I just had slightly more success by manually editing the tar.gz after creation and duplicating/moving the MAR-INF folder up into the root. Then it gets past the model version error and instead got errors starting the workers because the file layout appears incorrect.

maaquib commented 3 years ago

@chrisella Can you provide the command you ran to create the archive file?

chrisella commented 3 years ago

Hi,

So in the end I've solved this but with some headaches alone the way. It turns out using torch-model-archiver doesn't work for AWS SageMaker using either default (MAR) or tgz as the archive option. The ONLY way I've been able to get this to work us use the --archive-format no-archive option which sets everything up in a directory, I then add some extra files specific to my setup and tar.gz the directory in the normal way tar -zcv -C target_folder/ ./ --transform's,^\./,,' >| "final_package.tar.gz" The reason the tar command is more complex than normal is that on windows I was getting it nesting or adding an unneccesary sibling dot (.) folder.

So ultimately the folder created by the archiver is:

- my_package
  |- MAR-INF
  |  |- MANIFEST.json
  |- my_handler.py
  |- requirements.txt
  |- model.pt

This structure (then tar.gz'ed) is the way I've been able to get a package uploaded to S3 even recognised and running successfully on SageMaker (under TorchServe) on a customised container based off 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-cpu-py36-ubuntu18.04 (customised just to install some extra requirements).

As a Note, the reason the --archive-format tgz option doesn't seem to work is that it nests everything within a folder inside the archive which doesn't play nice with SageMaker.

n0thing233 commented 3 years ago

@chrisella Did your use case single model or multi model ? i.e did you use sagemaker multi mode endpoint or not?

I'm facing exactly same issue as @dean-cpi , what is really challenging is that you don't know how sagemaker make calls (with what parameters, some of them probably hard-coded) to MMS, you can only infer sagemaker's behavior from error message. But I like MMS a lot, so my plan now is to customize MMS to make it compatible with sagemaker multimodal endpoint.

n0thing233 commented 3 years ago

Just want to give an update so that it could help other people. I made it work by:

  1. set model_store to "/"
  2. preload_model false
  3. be careful when using model archiver, need to set archive-format to no-archive and then compress to .tar.gz by yourself, when decompressing, it should decompress to your model files instead of a parent directory containing model files.
  4. set default_workers_per_model to 1