Hydrospheredata / hydro-serving

MLOps Platform
http://docs.hydrosphere.io
Apache License 2.0
271 stars 42 forks source link

Container stop running with OOM Exception when upload model #317

Closed archichen closed 4 years ago

archichen commented 4 years ago

Hydro manager container stops running when I upload a model that size about 123M. Hydro manager reports an error java.lang.OutOfMemoryError: Java heap space

Command:

hs upload --name test --runtime hydrosphere/serving-runtime-tensorflow-1.13.1:2.1.0

Error details:

[2020-03-25 09:25:39.407][INFO][manager-akka.actor.default-dispatcher-33] i.h.s.m.a.h.c.m.ModelController.$anonfun$uploadModel$3.72 Upload request path=/tmp/payload3272338121557548618filename,2.1.0,None),None,None,None,None,Some(Map()))
Uncaught error from thread [manager-akka.actor.default-dispatcher-33]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[manager]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:121)
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:292)
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readBytes(CodedInputStream.java:939)
        at org.tensorflow.framework.TensorProto.<init>(TensorProto.java:98)
        at org.tensorflow.framework.TensorProto.<init>(TensorProto.java:13)
        at org.tensorflow.framework.TensorProto$1.parsePartialFrom(TensorProto.java:3922)
        at org.tensorflow.framework.TensorProto$1.parsePartialFrom(TensorProto.java:3917)
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923)
        at org.tensorflow.framework.AttrValue.<init>(AttrValue.java:118)
        at org.tensorflow.framework.AttrValue.<init>(AttrValue.java:15)
        at org.tensorflow.framework.AttrValue$1.parsePartialFrom(AttrValue.java:5258)
        at org.tensorflow.framework.AttrValue$1.parsePartialFrom(AttrValue.java:5253)
        at org.tensorflow.framework.AttrValue$Builder.mergeFrom(AttrValue.java:4113)
        at org.tensorflow.framework.AttrValue$Builder.mergeFrom(AttrValue.java:3906)
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:907)
        at com.google.protobuf.MapEntryLite.parseField(MapEntryLite.java:128)
        at com.google.protobuf.MapEntryLite.parseEntry(MapEntryLite.java:184)
        at com.google.protobuf.MapEntry.<init>(MapEntry.java:106)
        at com.google.protobuf.MapEntry.<init>(MapEntry.java:50)
        at com.google.protobuf.MapEntry$Metadata$1.parsePartialFrom(MapEntry.java:70)
        at com.google.protobuf.MapEntry$Metadata$1.parsePartialFrom(MapEntry.java:64)
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923)
        at org.tensorflow.framework.NodeDef.<init>(NodeDef.java:90)
        at org.tensorflow.framework.NodeDef.<init>(NodeDef.java:9)
        at org.tensorflow.framework.NodeDef$1.parsePartialFrom(NodeDef.java:1678)
        at org.tensorflow.framework.NodeDef$1.parsePartialFrom(NodeDef.java:1673)
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923)
        at org.tensorflow.framework.GraphDef.<init>(GraphDef.java:64)
        at org.tensorflow.framework.GraphDef.<init>(GraphDef.java:13)
        at org.tensorflow.framework.GraphDef$1.parsePartialFrom(GraphDef.java:1543)
        at org.tensorflow.framework.GraphDef$1.parsePartialFrom(GraphDef.java:1538)
[2020-03-25 09:25:43.213][INFO][shutdownHook1] i.h.s.m.ManagerBoot$.$anonfun$new$16.95 Stopping all contexts
[2020-03-25 09:25:43.361][ERROR][manager-akka.actor.default-dispatcher-43] a.a.ActorSystemImpl.$anonfun$applyOrElse$1.69 Uncaught error from thread [manager-akka.actor.default-dispatcher-33]:orSystem[manager]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3520) ~[?:1.8.0_151]
        at com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:121) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:292) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readBytes(CodedInputStream.java:939) ~[protobuf-java-3.6.1.jar:?]
        at org.tensorflow.framework.TensorProto.<init>(TensorProto.java:98) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.TensorProto.<init>(TensorProto.java:13) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.TensorProto$1.parsePartialFrom(TensorProto.java:3922) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.TensorProto$1.parsePartialFrom(TensorProto.java:3917) ~[proto-1.10.0.jar:?]
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923) ~[protobuf-java-3.6.1.jar:?]
        at org.tensorflow.framework.AttrValue.<init>(AttrValue.java:118) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.AttrValue.<init>(AttrValue.java:15) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.AttrValue$1.parsePartialFrom(AttrValue.java:5258) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.AttrValue$1.parsePartialFrom(AttrValue.java:5253) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.AttrValue$Builder.mergeFrom(AttrValue.java:4113) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.AttrValue$Builder.mergeFrom(AttrValue.java:3906) ~[proto-1.10.0.jar:?]
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:907) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntryLite.parseField(MapEntryLite.java:128) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntryLite.parseEntry(MapEntryLite.java:184) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntry.<init>(MapEntry.java:106) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntry.<init>(MapEntry.java:50) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntry$Metadata$1.parsePartialFrom(MapEntry.java:70) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.MapEntry$Metadata$1.parsePartialFrom(MapEntry.java:64) ~[protobuf-java-3.6.1.jar:?]
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923) ~[protobuf-java-3.6.1.jar:?]
        at org.tensorflow.framework.NodeDef.<init>(NodeDef.java:90) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.NodeDef.<init>(NodeDef.java:9) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.NodeDef$1.parsePartialFrom(NodeDef.java:1678) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.NodeDef$1.parsePartialFrom(NodeDef.java:1673) ~[proto-1.10.0.jar:?]
        at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:923) ~[protobuf-java-3.6.1.jar:?]
        at org.tensorflow.framework.GraphDef.<init>(GraphDef.java:64) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.GraphDef.<init>(GraphDef.java:13) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.GraphDef$1.parsePartialFrom(GraphDef.java:1543) ~[proto-1.10.0.jar:?]
        at org.tensorflow.framework.GraphDef$1.parsePartialFrom(GraphDef.java:1538) ~[proto-1.10.0.jar:?]

How should I solve this problem?

KineticCookie commented 4 years ago

Hi @archichen The reason for the exception is protobuf parser. When you upload a model without explicit contract, manager service tries to parse SavedModel proto message and extract signature information. And I guess the standard Java memory settings wasn't enough for such a model.

You can solve the problem by increasing Java heap size by passing appropriate values to the following environment variables https://github.com/Hydrospheredata/hydro-serving-manager/blob/5fe69ed83f02dd9465542738e4853dc91b264bf2/src/main/docker/start.sh#L25

archichen commented 4 years ago

@KineticCookie Thank you!