awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
998 stars 230 forks source link

Batching #732

Closed miguelvr closed 5 years ago

miguelvr commented 5 years ago

Hey guys,

I'm setting up a face detection service and am currently using your SSD example as a baseline.

I've noticed that you have an assert here https://github.com/awslabs/mxnet-model-server/blob/aa8f8a3473e6f06780ffd362fb8722a65affd380/examples/model_service_template/mxnet_model_service.py#L51 checking if the batch size is equal to 1

Does this mean I can not process multiple files at the same time?

More precisely, I would like to run an MTCNN service that detects all the faces in an image and then calll a Face Embedding service that computes all the embeddings of said faces.

Is it possible to batch the faces detected and compute the embeddings in one go?

I would like to know if it is just a matter of overloading the correct methods, ot if MMS does something clever in the background.

Thanks, Miguel

radao commented 5 years ago

Following. Would be nice to have a similar description of how batching is handled like for tensorflow serving.

frankfliu commented 5 years ago

@miguelvr MMS support automatic batch the HTTP request. You can use management API to register a model with batch support: https://github.com/awslabs/mxnet-model-server/blob/master/docs/management_api.md#register-a-model

If you enabled batch mode for your model, your custom service code will receive batched request. It's your service code's responsibility to batch then and send the batch request to engine. The example code didn't implement batch.

The the MTCNN case, it's different from batch feature the MMS is supporting. MMS didn't restrict what you can do in your service code. You can achieve what you need by override the handle function.

miguelvr commented 5 years ago

@frankfliu thanks for the quick response! That's what I suspected.

It would be nice however to have this feature highlighted somewhere as it is common practice with most model servers

radao commented 5 years ago

@miguelvr MMS support automatic batch the HTTP request. You can use management API to register a model with batch support: https://github.com/awslabs/mxnet-model-server/blob/master/docs/management_api.md#register-a-model

If you enabled batch mode for your model, your custom service code will receive batched request. It's your service code's responsibility to batch then and send the batch request to engine. The example code didn't implement batch.

The the MTCNN case, it's different from batch feature the MMS is supporting. MMS didn't restrict what you can do in your service code. You can achieve what you need by override the handle function.

@frankfliu If we use the model-archiver utility to package our models, how should we set batch_size and max_batch_delay? Via the management API? If so, it is peculiar that they are not available for setting by the model archiver.

frankfliu commented 5 years ago

@radao In current implementation, management API is the only way to set batch size. Exposing more options in model-archiver tool is on our road map. This feature will come together with new model archive format spec.

We are working on document about batch support and will provide example models that support batch.

lupesko commented 5 years ago

@frankfliu should we add an example show casing how to use batched requests in your request handler code, and showcasing the performance boost? @vdantu as an FYI.

miguelvr commented 5 years ago

@lupesko that would be very helpful

vdantu commented 5 years ago

I think this is definitely needed. I can look into writing an example with batching.

Thanks, Vamshi

ddavydenko commented 5 years ago

We do have a story for working on this in our team's internal backlog, will try to prioritize it for one of the upcoming sprints.

JustinhoCHN commented 5 years ago

@frankfliu hi, may I ask how to change the model batch_size after starting the mms in the docker? when the mms starts in the docker, the model registered automatically, if I post a register request like:

curl -v -X POST "http://localhost:8086/models?url=resnet.mar&model_name=image_retrieval&batch_size=30"

It'll return a message: "message": "Model image_retrieval is already registered.". If I DELETE the register model, and register model with command:

curl -v -X POST "http://localhost:8086/models?initial_workers=1&synchronous=true&url=resnet.mar"

The model registered, but still can't change the batch_size, even I PUT a request:

curl -v -X PUT "http://localhost:8086/models/resnet?batch_size=30&synchronous=true"

it'll return:

*   Trying ::1...
* Connected to localhost (::1) port 8086 (#0)
> PUT /models/image_retrieval?batch_size=30&synchronous=true HTTP/1.1
> Host: localhost:8086
> User-Agent: curl/7.49.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: 2bf616da-bab8-40d7-86d6-a3893a55c4cd
< content-length: 33
< connection: keep-alive
< 
{
  "status": "Workers scaled"
}
* Connection #0 to host localhost left intact
ddavydenko commented 5 years ago

@JustinhoCHN, could you please share what is the output of curl "http://localhost:8086/models/resnet" after you tried to update its batch_size with POST request? I am trying to see what leads you to conclude that batch size is not updated.

Also, as a side note - you can set batch_size during register call (POST to /models endpoint). Consecutive calls to scale workers (PUT to /models/) cannot change size of batch so you don't need that one.

JustinhoCHN commented 5 years ago

@ddavydenko Thank you for your quick reply. Here is the return info after the POST request (I just post the request after the container restart, not after DELETE the register model).

curl -v -X POST "http://localhost:8086/models?url=resnet.mar&model_name=image_retrieval&batch_size=30"
*   Trying ::1...
* Connected to localhost (::1) port 8086 (#0)
> POST /models?url=resnet.mar&model_name=image_retrieval&batch_size=30 HTTP/1.1
> Host: localhost:8086
> User-Agent: curl/7.49.0
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< content-type: application/json
< x-request-id: 921e4a3e-c882-47e1-907f-caeb1e0e525b
< content-length: 112
< 
{
  "code": 400,
  "type": "BadRequestException",
  "message": "Model image_retrieval is already registered."
}
* Connection #0 to host localhost left intact

Here's curl "http://localhost:8086/models/image_retrieval" output:

{
  "modelName": "image_retrieval",
  "modelUrl": "resnet.mar",
  "runtime": "python",
  "minWorkers": 1,
  "maxWorkers": 1,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "workers": [
    {
      "id": "9000",
      "startTime": "2019-01-30T01:16:36.394Z",
      "status": "READY",
      "gpu": true,
      "memoryUsage": 1950052352
    }
  ]
}

If I DELETE the register model and POST again:

curl -v -X POST "http://localhost:8086/models?initial_workers=1&batch_size=30&synchronous=true&url=resnet.mar&model_name=image_retrieval"
*   Trying ::1...
* Connected to localhost (::1) port 8086 (#0)
> POST /models?initial_workers=1&batch_size=30&synchronous=true&url=resnet.mar&model_name=image_retrieval HTTP/1.1
> Host: localhost:8086
> User-Agent: curl/7.49.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< content-type: application/json
< x-request-id: 116d61f2-f362-4006-8a2d-a106e977b2ab
< content-length: 120
< 
{
  "code": 404,
  "type": "ModelNotFoundException",
  "message": "Load failed... Deregistered model image_retrieval"
}
* Connection #0 to host localhost left intact

So I want to register again, I can only POST curl -v -X POST "http://localhost:8086/models?initial_workers=1&synchronous=true&url=resnet.mar without batch_size and other argument.

{
  "modelName": "resnet",
  "modelUrl": "resnet.mar",
  "runtime": "python",
  "minWorkers": 1,
  "maxWorkers": 1,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "workers": [
    {
      "id": "9002",
      "startTime": "2019-01-30T02:22:44.951Z",
      "status": "READY",
      "gpu": true,
      "memoryUsage": 1854517248
    }
  ]
}
ddavydenko commented 5 years ago

Ok, I think what's happening when you DELETE model (unregister it by posting DELETE to /models/) is that MMS actually deletes its model archive (.mar file) as well. You can confirm that by checking what is left in /opt/ml/model folder (or whatever is in your model_store setting) right after DELETE call.

So I think that is why your register attempt right after DELETE call fails. In order to do it cleanly, I suggest you DELETE model after container bounced and then register it by referring to url of the model where it actually exists (maybe on S3) so that MMS would have a strong link to where it could download model from. During this same POST call you can specify batch_size param.

I admit this is not the best experience for MMS users, but hopefully this workaround could be sufficient for now. Please let me know if this approach works for you.

ddavydenko commented 5 years ago

After some more digging around and reproducing issue I can confirm this is actually a bug with MMS. The call to register model with batch_size param works only if there is no synchronous parameter specified. So after DELETE of the model call like this: curl -X POST "localhost:8081/models?url=resnet50_ssd.mar&model_name=resnet50&batch_size=8&initial_workers=4" works fine, but call like this curl -X POST "localhost:8081/models?url=resnet50_ssd.mar&model_name=resnet50&batch_size=8&initial_workers=4&synchronous=true" fails with { "code": 404, "type": "ModelNotFoundException", "message": "Load failed... Deregistered model resnet50" }

We will prioritize this as a fix in near future. In the meantime you can use async approach to register model in order to change its batch size.

JustinhoCHN commented 5 years ago

@ddavydenko Great! I'll try your suggestion and will let you know my result later. You guys are so effiecient! Thanks again!

JustinhoCHN commented 5 years ago

so I remove the synchronous params from the POST request, the model registered successfully, but the model seems not loading:

{
  "modelName": "image_retrieval",
  "modelUrl": "resnet_triplet.mar",
  "runtime": "python",
  "minWorkers": 1,
  "maxWorkers": 1,
  "batchSize": 30,
  "maxBatchDelay": 100,
  "workers": [
    {
      "id": "9002",
      "startTime": "2019-01-30T06:08:29.942Z",
      "status": "UNLOADING",
      "gpu": true,
      "memoryUsage": 0
    }
  ]
}

status: unloading. MMS didn't start the model service, and didn't use any gpu memory. How to start the worker and make it load the model again? @ddavydenko

ddavydenko commented 5 years ago

Hm, this might be even bigger issue than initially thought. Let us investigate on our side and comment on findings within a day or two. Sorry for the inconvenience :(

— Thanks, Denis

On Jan 29, 2019, at 10:12 PM, Justin.h notifications@github.com wrote:

so I remove the synchronous params from the POST request, the model registered successfully, but the model seems not loading:

{ "modelName": "image_retrieval", "modelUrl": "resnet_triplet.mar", "runtime": "python", "minWorkers": 1, "maxWorkers": 1, "batchSize": 30, "maxBatchDelay": 100, "workers": [ { "id": "9002", "startTime": "2019-01-30T06:08:29.942Z", "status": "UNLOADING", "gpu": true, "memoryUsage": 0 } ] } status: unloading. MMS didn't start the model service, and didn't use any gpu memory. How to start the worker and make it load the model again? @ddavydenko

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

miguelvr commented 5 years ago

I would also like to understand the behavior of MMS batching as it is not clear from the little description provided.

When setting a specific value X for batch size, does it mean the model server will only accept batch sizes of X or is it up to X, with X being an upper bound?

My goal is to have some servers accepting a variable number of samples at the same time, depending of what is available to be processed, so having an upper bound would be useful while having a fixed batch size, not so much.

vdantu commented 5 years ago

@JustinhoCHN and @ddavydenko : I am unable to recreate the issue with initial workers and batch size configured. I did the following

  1. Downloaded the squeezenet_v1.1.mar from MMS model-zoo into /tmp/model-store folder
  2. Started MMS as follows
    mxnet-model-server --model-store /tmp/model-store
  3. Ran the following register command
    curl -X POST localhost:8081/models?url=squeezenet_v1.1.mar\&batch_size=30\&max_batch_delay=40\&initial_workers=1\&synchronous=true

It seems to load fine

curl localhost:8081/models/squeezenet_v1.1

{
  "modelName": "squeezenet_v1.1",
  "modelVersion": "1.0",
  "modelUrl": "squeezenet_v1.1.mar",
  "engine": "MXNet",
  "runtime": "python",
  "minWorkers": 1,
  "maxWorkers": 1,
  "batchSize": 30,
  "maxBatchDelay": 40,
  "workers": [
    {
      "id": "9000",
      "startTime": "2019-01-30T07:45:28.133Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 131043328
    }
  ]
}

@JustinhoCHN : If you are seeing that the model is in "UNLOADING" state, that means your model wasn't loaded properly. Look at the mms_log.log in the logs folder. It would have logs of what the issue is. Or, share your log file so that we can also take a look and try to help out.

@miguelvr : We are in the process of writing a document. It will be out shortly. But, for now let me try to answer your questions here,

When setting a specific value X for batch size, does it mean the model server will only accept batch sizes of X or is it up to X, with X being an upper bound?

Short Answer : It is upto X messages. You get to choose the batch_size and the max_batch_delay for every model you register. Currently there is no way to configure these two params at startup. The max_batch_delay timer for a particular model starts when the model server receives the first inference request for that model and it keeps getting rescheduled as long as there are inference requests in the queue. If you receive X messages before max_batch_delay time expires, then the timer is switched off and the backend worker is given all the X messages. If you receive X-delta messages before max_batch_delay timeout, then the backend worker is given only X-delta messages.

JustinhoCHN commented 5 years ago

@vdantu I'm afraid it didn't work, I tried again and still the same. BTW I'm using Docker, maybe you should try the docker version. Here's the log: @vdantu @ddavydenko mms_log.log

vdantu commented 5 years ago

I tried with docker container 1.0.1 CPU as well, and it works. Could you let us know what container are you using? Also, can you share your log file?

Thanks, Vamshi

On Thu, Jan 31, 2019, 1:29 AM Justin.h <notifications@github.com wrote:

@vdantu https://github.com/vdantu I'm afraid it didn't work, I tried again and still the same. BTW I'm using Docker, maybe you should try the docker version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/mxnet-model-server/issues/732#issuecomment-459276394, or mute the thread https://github.com/notifications/unsubscribe-auth/AiiLNEad2GrYAnehtDV6UpUp2L7dyA4Mks5vIreEgaJpZM4aPNsV .

JustinhoCHN commented 5 years ago

@vdantu Sorry for the late reply, these days are Chinese lunar new year, happy Chinese new year! I'm using the 1.0.0 GPU docker container, here's my log: mms_log.log

JustinhoCHN commented 5 years ago

hi @ddavydenko , any progress for now? @vdantu , I tried the 1.0.1 CPU container, still the same problem that mentioned above, can you provide your command detail? I ran the docker container like:

sudo docker run -itd --name mms -p 8087:8087  -p 8088:8088 -v \
/home/huzhihao/projects/image_retrieval_for_supply_chain/models/:/models \
awsdeeplearningteam/mxnet-model-server:1.0.1-mxnet-cpu \
mxnet-model-server --start \
--mms-config /models/config.properties \
--model-store=/models \
--models image_retrieval=resnet_triplet.mar

I'm using shared volume which contains config.properties and .mar file, if I register the model with:

curl -v -X POST localhost:8088/models?url=resnet_triplet.mar\&model_name=image_retrieval\&batch_size=30\&max_batch_delay=40\&initial_workers=1\&synchronous=true

"Load failed... Deregistered model image_retrieval" 404 error raised.

If I register the model with:

curl -v -X POST localhost:8088/models?url=resnet_triplet.mar\&model_name=image_retrieval\&batch_size=30\&max_batch_delay=40\&initial_workers=1

"Model image_retrieval is already registered."

If I delete the model and register again with:

curl -v -X POST localhost:8088/models?url=resnet_triplet.mar\&model_name=image_retrieval\&batch_size=30\&max_batch_delay=40\&initial_workers=1

logs will output:

AssertionError: Batch is not supported.
Load model failed: image_retrieval, error: Worker died.
W-9044-image_retrieval State change WORKER_STARTED -> WORKER_STOPPED
Retry worker: 9044 in 8 seconds.
vdantu commented 5 years ago

@JustinhoCHN : I went through the logs you had sent. So the example models in our model zoo don't support batching by default. I am in the process of writing a more tutorials on how to write batching code. But, in the mean time to unblock you,

  1. You are getting this error from your custom service code, here.
  2. You have to modify the preprocess, inference and post process code to handle a list of requests.

MMS sends a list of inputs to the preprocess, seen here as parameter named batch. Your model code should use this list of input requests to preprocess and send it to the inference method.

JustinhoCHN commented 5 years ago

Thank you @vdantu and mms team, that's what we want, it really helps! Few days ago I stuck in how to handle number of requests which less than a batch, your padding method solved this problem! Thanks again!

vdantu commented 5 years ago

@JustinhoCHN : Thats awesome :) . Really glad that the batching work we did helped. I could think of two options for variable batch size

  1. Padding with 0's and ignoring this result on return path, or
  2. Re-bind the network at every inference request.

Since this logic is in the inference path , i.e., re-bind would have to be done for every batch of request that comes in, it can quickly become very inefficient :) . So I chose to go with "padding with 0's" solution.

erandagan commented 5 years ago

@radao In current implementation, management API is the only way to set batch size. Exposing more options in model-archiver tool is on our road map. This feature will come together with new model archive format spec.

We are working on document about batch support and will provide example models that support batch.

@frankfliu Is there a way to contribute to configuring the batch size through the model spec? I'd be happy to help.

ddavydenko commented 5 years ago

@erandagan , awesome to see your activities, keep 'em coming! :)

As for this particular question - could you please clarify what exactly you are trying to achieve? Describing it from user perspective would probably be the best way.

erandagan commented 5 years ago

@ddavydenko Sure.

We're deploying a (set of) MMS servers using K8s, and it would be more friendly to operate if we could configure the batch settings directly within the model, rather than POSTing the model along with it's settings after deployment.

This makes things a lot easier from an operational standpoint, as POSTing to the model server is actually mutating the server's state. So, currently, if the server were to crash and K8s would spawn a new server pod, we'd have to reconfigure that pod again as it would boot into a "wrong" state. In contrast, if the batch size was set in the model archive, the server would boot directly into the desired state.

vdantu commented 5 years ago

@erandagan : Thats a good feature to have as well. Could you have share a short writeup on what changes you were proposing?

IMO, there is a model section in the model-archive (https://github.com/awslabs/mxnet-model-server/blob/master/model-archiver/model_archiver/manifest_components/model.py). This could be a good place for "max_batch_size", "max_batch_delay". Frontend "Model.java" already has a copy of the ModelArchive, which contains Manifest with this Model.

erandagan commented 5 years ago

@vdantu I'll follow up on #770