awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
984 stars 230 forks source link

`com.amazonaws.ml.mms.metrics.MetricCollector - java.io.IOException: Broken pipe` and `error while loading shared libraries: libpython3.7m.so.1.0` #992

Open llorenzo-matterport opened 2 years ago

llorenzo-matterport commented 2 years ago

Hi there!

We're encountering an issue with MMS and deployment of MXNET models. We thought it was related to the way we're packing the model, but after some digging, it seems that it's related to MMS with MXNET in CPU mode.

The errors we're seeing are related to metrics throwing exceptions, from both a host with and without GPU devices, steps to reproduce: 1 docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04 2 docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb (change your image) 3 multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar  And:

% docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb
root@eb4f03280c9c:/# multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
root@eb4f03280c9c:/# 2022-02-04T22:35:40,112 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /usr/local/lib/python3.7/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 1547 M
Python executable: /usr/local/bin/python3.7
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Model Store: N/A
Initial Models: squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2022-02-04T22:35:40,125 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar  preload_model: false
2022-02-04T22:35:41,145 [WARN ] main com.amazonaws.ml.mms.ModelServer - Failed to load model: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
com.amazonaws.ml.mms.archive.DownloadModelException: Failed to download model from: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar , code: 403
    at com.amazonaws.ml.mms.archive.ModelArchive.download(ModelArchive.java:156) ~[model-server.jar:?]
    at com.amazonaws.ml.mms.archive.ModelArchive.downloadModel(ModelArchive.java:72) ~[model-server.jar:?]
    at com.amazonaws.ml.mms.wlm.ModelManager.registerModel(ModelManager.java:99) ~[model-server.jar:?]
    at com.amazonaws.ml.mms.ModelServer.initModelStore(ModelServer.java:212) [model-server.jar:?]
    at com.amazonaws.ml.mms.ModelServer.start(ModelServer.java:315) [model-server.jar:?]
    at com.amazonaws.ml.mms.ModelServer.startAndWait(ModelServer.java:103) [model-server.jar:?]
    at com.amazonaws.ml.mms.ModelServer.main(ModelServer.java:86) [model-server.jar:?]
2022-02-04T22:35:41,160 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-02-04T22:35:41,449 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-02-04T22:35:41,451 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-02-04T22:35:41,459 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
Model server started.
2022-02-04T22:35:41,477 [ERROR] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector -
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_292]
    at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_292]
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_292]
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_292]
    at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_292]
    at com.amazonaws.ml.mms.metrics.MetricCollector.run(MetricCollector.java:76) [model-server.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_292]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_292]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

root@eb4f03280c9c:/#

#### After a good while 1-2mins:
root@eb4f03280c9c:/# 2022-02-04T22:36:41,413 [ERROR] Thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - /usr/local/bin/python3.7: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

Extra info:

Any help to understand this would be appreciated, thanks!