Error Running OpenNMT TensorFlow Serving with Elastic Inference

mohammedayub44 commented 5 years ago

Hi @guillaumekln ,

I have generated Transformer model using this repo. I'm running the serving model and it runs perfectly fine. AWS released something called as Elastic Inference-EI which makes Inference cost effective. The EI example also works fine independently on the AMI. I'm trying to see how I can combine these both , i.e. to run the OpenNMT model using Elastic Inference. I try to instantiate the server and get an error as described below:

(amazonei_tensorflow_p36) ubuntu@ip-172-31-13-226:~/OpenNMT-tf/examples/serving$ AmazonEI_TensorFlow_Serving_v1.12_v1 --model_name=ende --model_base_path=/home/ubuntu/OpenNMT-tf/examples/serving/ende --port=9000

Error message: `2019-01-31 16:37:29.234486: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: ende model_base_path: /home/ubuntu/OpenNMT-tf/examples/serving/ende 2019-01-31 16:37:29.234620: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models. 2019-01-31 16:37:29.234634: I tensorflow_serving/model_servers/server_core.cc:558] (Re-)adding model: ende 2019-01-31 16:37:29.335951: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: ende version: 1539080952} 2019-01-31 16:37:29.335975: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: ende version: 1539080952} 2019-01-31 16:37:29.335989: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: ende version: 1539080952} 2019-01-31 16:37:29.336007: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /home/ubuntu/OpenNMT-tf/examples/serving/ende/1539080952 2019-01-31 16:37:29.337094: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /home/ubuntu/OpenNMT-tf/examples/serving/ende/1539080952 2019-01-31 16:37:29.357366: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve } 2019-01-31 16:37:29.383893: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2019-01-31 16:37:29.484275: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:162] Restoring SavedModel bundle. 2019-01-31 16:37:29.833011: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:138] Running MainOp with key legacy_init_op on SavedModel bundle. 2019-01-31 16:37:29.899536: I external/org_tensorflow/tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /home/ubuntu/OpenNMT-tf/examples/serving/ende/1539080952/assets/wmtende.vocab is already initialized. 2019-01-31 16:37:29.899685: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: success. Took 562581 microseconds. Using Amazon Elastic Inference Client Library Version: 1.2.8 Number of Elastic Inference Accelerators Available: 1 Elastic Inference Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 Elastic Inference Accelerator Type: eia1.medium

2019-01-31 16:37:31.567585: F external/org_tensorflow/tensorflow/contrib/ei/convert/convert_graph.cc:792] Non-OK-status: tensorflow::ConvertGraphDefToGraph( tensorflow::GraphConstructorOptions(), in_graph_def, &graph) status: Not found: No attr named 'out_type' in NodeDef: [[{{node transformer/encoder_1/Shape}} = ShapeT=DT_FLOAT, _output_shapes=[[3]]]] [[{{node transformer/encoder_1/Shape}} = ShapeT=DT_FLOAT, _output_shapes=[[3]]]] Aborted at 1548952651 (unix time) try "date -d @1548952651" if you are using GNU date PC: @ 0x0 (unknown) Aborted (core dumped) `

Not sure what I'm doing wrong. Any help is appreciated.

Thanks !

Mohammed Ayub

guillaumekln commented 5 years ago

Hi,

This looks like a TensorFlow Serving version that is older than the TensorFlow version used to produce the saved model. Can you check if this applies?

mohammedayub44 commented 5 years ago

On my Elastic Inference (EI) machine, I downgraded my tensorflow and tensorflow-serving-api version from 1.12.0 to 1.11.0 to match my other machine. Ran the command on both machines it still gives me the same error message on EI machine.

-Mohammed Ayub

mohammedayub44 commented 5 years ago

@guillaumekln

Just to try out other NMT models, I also tried our custom NMT model (apart from the demo model) ..it seemed to start the server correctly but got an error while sending the request from the client.

`(amazonei_tensorflow_p36) ubuntu@ip-172-31-13-226:~/OpenNMT-tf/examples/serving$ AmazonEI_TensorFlow_Serving_v1.12_v1 --model_name=enes --model_base_path=/home/ubuntu/OpenNMT-tf/examples/serving/enes/ --port=9000 2019-01-31 16:46:44.462744: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: enes model_base_path: /home/ubuntu/OpenNMT-tf/examples/serving/enes/ 2019-01-31 16:46:44.462880: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models. 2019-01-31 16:46:44.462894: I tensorflow_serving/model_servers/server_core.cc:558] (Re-)adding model: enes 2019-01-31 16:46:44.563135: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: enes version: 1545081532} 2019-01-31 16:46:44.563160: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: enes version: 1545081532} 2019-01-31 16:46:44.563173: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: enes version: 1545081532} 2019-01-31 16:46:44.563192: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /home/ubuntu/OpenNMT-tf/examples/serving/enes/1545081532 2019-01-31 16:46:44.563205: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /home/ubuntu/OpenNMT-tf/examples/serving/enes/1545081532 2019-01-31 16:46:44.582681: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve } 2019-01-31 16:46:44.608210: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2019-01-31 16:46:44.687842: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:162] Restoring SavedModel bundle. 2019-01-31 16:46:44.984697: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:138] Running MainOp with key saved_model_main_op on SavedModel bundle. 2019-01-31 16:46:45.048203: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: success. Took 484983 microseconds. Using Amazon Elastic Inference Client Library Version: 1.2.8 Number of Elastic Inference Accelerators Available: 1 Elastic Inference Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 Elastic Inference Accelerator Type: eia1.medium

2019-01-31 16:46:46.935920: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at lookup_table_init_op.cc:143 : Not found: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt; No such file or directory 2019-01-31 16:46:46.935928: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at lookup_table_init_op.cc:143 : Not found: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt; No such file or directory 2019-01-31 16:46:46.935920: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at lookup_table_init_op.cc:143 : Not found: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt; No such file or directory 2019-01-31 16:46:46.944061: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:83] No warmup data file found at /home/ubuntu/OpenNMT-tf/examples/serving/enes/1545081532/assets.extra/tf_serving_warmup_requests 2019-01-31 16:46:46.945022: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: enes version: 1545081532} 2019-01-31 16:46:46.946248: I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:9000 ... 2019-01-31 16:49:15.805971: I EI-WARNING: :0]

EI may incur sub-optimal inference latency on this model due to below operators. Please contact amazon-ei-feedback@amazon.com with this message if inference latency does not meet your application requirements Operators: HashTableV2 InitializeTableFromTextFileV2 LookupTableFindV2 LookupTableSizeV2 StringToHashBucketFast

[Thu Jan 31 16:49:44 2019, 471280us] [Execution Engine][TensorFlow][3] Failed - Last Error: EI Error Code: [12, 5, 13] EI Error Description: Internal error EI Request ID: TF-977D79E1-BB85-4D04-85E2-9B7BA3DBE11A -- EI Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 EI Client Version: 1.2.8 2019-01-31 16:49:44.473289: W external/org_tensorflow/tensorflow/contrib/ei/kernels/eia_op.cc:166] EI Error: failed at eia_op.cc:166 : Not found: Last Error: EI Error Code: [12, 5, 13] EI Error Description: Internal error EI Request ID: TF-977D79E1-BB85-4D04-85E2-9B7BA3DBE11A -- EI Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 EI Client Version: 1.2.8

2019-01-31 16:49:44.484051: E external/org_tensorflow/tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Not found: Last Error: EI Error Code: [12, 5, 13] EI Error Description: Internal error EI Request ID: TF-977D79E1-BB85-4D04-85E2-9B7BA3DBE11A -- EI Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 EI Client Version: 1.2.8

2019-01-31 16:49:51.958703: E external/org_tensorflow/tensorflow/core/common_runtime/executor.cc:623] Executor failed to create kernel. Not found: Last Error: EI Error Code: [12, 5, 13] EI Error Description: Internal error EI Request ID: TF-977D79E1-BB85-4D04-85E2-9B7BA3DBE11A -- EI Accelerator ID: eia-fc41d6899c3a4fee8648a6be60c8e452 EI Client Version: 1.2.8 [[{{node eop_5a7f25ee8db04424}} = EIAOp[InT=[DT_INT32, DT_INT32, DT_INT64, DT_INT64, DT_INT64, DT_INT64, DT_INT64, DT_INT32, DT_INT32, DT_INT32], OutT=[DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT64], input_nodes=["Placeholder_1/_7", "Placeholder_1/_8", "string_to_index_Lookup/hash_bucket/_9", "string_to_index_Lookup/hash_table_Size/_10", "string_to_index_Lookup/hash_table_Lookup/_11", "string_to_index/hash_table/Const/_12", "string_to_index_Lookup/hash_table_Lookup/_13", "Placeholder_1/_14", "Placeholder_1/_15", "Placeholder_1/_16"], output_nodes=["transformer/decoder/while/Exit_6_tmp", "transformer/decoder/Select_1_tmp", "transformer/decoder/sub_tmp", "transformer/Cast"], parent_model_id="f42996d046e728a7", serialized_engine="\n\201\001...n\002:\000", _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_1_0_1, _arg_Placeholder_1_0_1, string_to_index_Lookup/hash_bucket, string_to_index_Lookup/hash_table_Size, string_to_index_Lookup/hash_table_Lookup, string_to_index/hash_table/Const, string_to_index_Lookup/hash_table_Lookup, _arg_Placeholder_1_0_1, _arg_Placeholder_1_0_1, _arg_Placeholder_1_0_1)]]`

My thinking is EI folks have not implemented every operator used in NMT models in their Graph conversion script ?

-Mohammed Ayub

guillaumekln commented 5 years ago

2019-01-31 16:46:46.935920: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at lookup_table_init_op.cc:143 : Not found: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt; No such file or directory

The warmup phase is failing on this, so it's likely to be also failing on the first request.

Is this a path on your training server? If yes, how did you export the model and what was the configuration?

My thinking is EI folks have not implemented every operator used in NMT models in their Graph conversion script ?

Do they have a conversion script? Is this documented?

mohammedayub44 commented 5 years ago

2019-01-31 16:46:46.935920: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at lookup_table_init_op.cc:143 : Not found: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt; No such file or directory

The warmup phase is failing on this, so it's likely to be also failing on the first request.

Yes this was the path on my training server. The model was exported automatically by the train_and_eval command. Config file looked something like below:

model_dir: /home/ubuntu/mayub/datasets/in_use/euro/nfpa_full_data_runs/en_es_transformer_e_da data: train_features_file: /home/ubuntu/mayub/datasets/in_use/euro/nfpa_full_data_runs/nfpa_train_tokenized_bpe.en train_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/nfpa_full_data_runs/nfpa_train_tokenized_bpe.es eval_features_file: /home/ubuntu/mayub/datasets/in_use/euro/nfpa_full_data_runs/nfpa_dev_tokenized_bpe.en eval_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/nfpa_full_data_runs/nfpa_dev_tokenized_bpe.es source_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt target_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt

params: replace_unknown_target: true

train: save_checkpoints_steps: 1000 keep_checkpoint_max: 3 save_summary_steps: 1000 train_steps: 30000 batch_size: 3072

eval: eval_delay: 1800 external_evaluators: [BLEU,BLEU-detok]

Those two files are present in the 'assests' folder in my exported model. enes\1545081532\assets\src_vocab_50k.txt enes\1545081532\assets\trg_vocab_50k.txt I'm not sure why its looking for training server path, any way I can change that to look in assets folder instead ?

Do they have a conversion script? Is this documented?

I don't think so they have open sourced the script. I have emailed them and waiting for feedback.

guillaumekln commented 5 years ago

I'm not sure why its looking for training server path, any way I can change that to look in assets folder instead ?

The paths should be relative to the assets directory by default. I will investigate as to why it appears to save absolute paths instead.

mohammedayub44 commented 5 years ago

Awesome thanks !

guillaumekln commented 5 years ago

Did you get some updates?

On my side, I was not able to reproduce the issue with the absolute path.

mohammedayub44 commented 5 years ago

Amazon EI team said that they could reproduce the issue and working to find a solution for it.

On my side, I was not able to reproduce the issue with the absolute path.

Hmm... So when you export the model (with train_and_eval command) it is giving you relative paths to assests folder ?

Let me know if there is any other way I can troubleshoot the path.

-Mohammed Ayub

mohammedayub44 commented 5 years ago

@guillaumekln Not sure if this is fixed in the latest releases.

I got a response from the Amazon team on their part-

"The root cause of the issue here is a custom function within the tensor2tensor library. We currently don’t support custom operators or functions and so this model fails to run on EI. The tensor2tensor library uses a custom function named convert_gradient_to_tensor. In the case of tensor2tensor, this convert_gradient_to_tensor function is used when the model is trained but is essentially not used when the model is applied for inference. "

My assumptions is there a way to not include these functions (or any function) which are not required during the inference in the model export step. If not, can this be another feature request. I'm guessing this will be helpful for others as well (hopefully shrinks the model size too). ?

Mohammed Ayub

guillaumekln commented 5 years ago

Not sure to understand why Tensor 2Tensor, another GitHub project, is coming up here.

The exported models from OpenNMT-tf do not use such custom operations and they work with the official TensorFlow Serving binaries.

domdivakaruni commented 5 years ago

Hi @guillaumekln OpenNMT-tf saved model export includes convert_gradient_to_tensor which is a non standard TF operator.

mohammedayub44 commented 5 years ago

Not sure, but I found this section in [Layers] (http://opennmt.net/OpenNMT-tf/_modules/opennmt/layers/common.html)

@function.Defun( python_grad_func=lambda x, dy: tf.convert_to_tensor(dy), shape_func=lambda op: [op.inputs[0].get_shape()]) def convert_gradient_to_tensor(x): """Wraps :obj:xto convert its gradient to a tensor.""" return x

Not sure if there any other custom functions used.

-Mohammed Ayub

guillaumekln commented 5 years ago

This function is no longer used since this commit https://github.com/OpenNMT/OpenNMT-tf/commit/aa6c542f184b8d65616da96e7ad6f4515d40bbcd

Can you give more details about when and how the model was exported?

You can also try out this serving example to ensure that the exported model works:

https://github.com/OpenNMT/OpenNMT-tf/blob/master/examples/serving/README.md

mohammedayub44 commented 5 years ago

Since this was couple of months ago. I was using an older version, I see a lot of changes have been fixed since then. I may need to run the process again with a latest version and see how it goes through. As long as all dependency from custom wrappers function does not exist I think the model export and EI should run fine. Will update you in a couple of days.

Mohammed Ayub

mohammedayub44 commented 5 years ago

I tried this with the updated custom binary file from Elastic Inference team and it worked. Need to try running with models built with latest OpenNMT version. Closing this for now.

OpenNMT / OpenNMT-tf

Error Running OpenNMT TensorFlow Serving with Elastic Inference #312