GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 855 forks source link

No op named HashTableV2 in defined operations #104

Closed luckyapplehead closed 6 years ago

luckyapplehead commented 7 years ago

For problems running the sample code please provide the following information.

System information

Describe the problem

I run the code and the command the same as the guide line, to train the DNN model on the server -- Google CloudML Engine. But the error occur as below: "No op named HashTableV2 in defined operations" I also try 'tensorflow-transform==0.1.10', 'tensorflow==1.2.0' as described in the requirement.txt and setup.py, but the same error occur. It seems like the env requirement is not correct to run this example. @elmer-garduno

Source code / logs

The entire error logs are as below:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 813, in <module> main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 809, in main output_dir=output_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run return task() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 465, in train_and_evaluate export_results = self._maybe_export(eval_result) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 484, in _maybe_export compat.as_bytes(strategy.name)))) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/export_strategy.py", line 32, in export return self.export_fn(estimator, export_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py", line 283, in export_fn exports_to_keep=exports_to_keep) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/framework/experimental.py", line 64, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1258, in export_savedmodel input_ops = input_fn() File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/input_fn_maker.py", line 46, in serving_input_fn receiver = receiver_fn() File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/input_fn_maker.py", line 375, in parsing_transforming_serving_input_receiver_fn transform_savedmodel_dir, raw_features)) File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_transform_io.py", line 248, in partially_apply_saved_transform saved_model_dir, logical_input_map, tensor_replacement_map) File "/root/.local/lib/python2.7/site-packages/tensorflow_transform/saved/saved_transform_io.py", line 142, in _partially_apply_saved_transform_impl input_map=input_map) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1566, in import_meta_graph **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 498, in import_scoped_meta_graph producer_op_list=producer_op_list) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 260, in import_graph_def raise ValueError('No op named %s in defined operations.' % node.op) ValueError: No op named HashTableV2 in defined operations. `

luckyapplehead commented 7 years ago

I fixed the issue by changing the requirement.txt to:

tensorflow==1.3.0
tensorflow-transform==0.3.1
protobuf==3.4.0

and adding install requirement for setup.py as below:

TENSORFLOW = 'tensorflow==1.3.0'
TENSORFLOW_TRANSFORM = 'tensorflow-transform==0.3.1'
PROTOBUF = 'protobuf==3.4.0'
...
install_requires=[TENSORFLOW, TENSORFLOW_TRANSFORM, PROTOBUF])

Then the training process can run successfully. But when I go to the next step, creating the model, the same error occur again...

command I used: gcloud ml-engine versions create "v1" --model "movielens" --origin "${MODEL_SOURCE}"

Error message: ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Error loading the model: Could not load model: Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'HashTableV2'\n\n"

Is there anything I can do to fix this problem?

puneith commented 7 years ago

Can you please try the updated setup.py. Also looking at the command about I am assuming this is movielens you are running?

luckyapplehead commented 7 years ago

@puneith hi, puneith, could you tell me to updated setup.py with what changes? Yes, I'm running movielens:)

luckyapplehead commented 7 years ago

I update the setup.py to the newest version, and the same error occurs when I run the following command: gcloud ml-engine versions create "v4" --model "movielens" --origin "${MODEL_SOURCE}"

error message: ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Error loading the model: Could not load model: Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'HashTableV2'\n\n"

puneith commented 7 years ago

@luckyapplehead Since Cloud ML Engine is still on TF1.2 using TF1.3 for training is causing discrepancy between training and prediction. We will have TF1.4 available on Cloud ML Engine very soon so this should go away. In the mean time can you please try training with TF1.2

puneetjindal commented 6 years ago

@puneith I am getting this error with TF 1.2. Also raised a support ticket but no help

puneith commented 6 years ago

@luckyapplehead Can you please try the training with 1.2 and then create the model to see if error persists, which means you will need to make sure the setup.py TF1.3 is commented.

puneith commented 6 years ago

@puneetjindal are you getting the error with TF1.2 for training or prediction? If its for prediction can you please confirm you trained using TF1.2?

dsdelhi commented 6 years ago

@puneith My training with TF 1.2 is working fine.My model gets exported to GCS along with variables folder. Issue comes when I go to create version of the model in GCP ML engine. I trained using TF 1.2 only

puneith commented 6 years ago

@dsdelhi Can I please get your GCP project_id.

puneetjindal commented 6 years ago

@puneith I hope you have noted my project id

puneith commented 6 years ago

@puneetjindal @dsdelhi Sorry for the delay. Are we still seeing this issue on TF1.4 Cloud ML Engine. Can you send me your project_id if you are still seeing the issue. @puneetjindal I don't see your project id in this thread.

dsdelhi commented 6 years ago

I shared the project id but it still exists

On Jan 26, 2018 10:11 PM, "Puneith Kaul" notifications@github.com wrote:

@puneetjindal https://github.com/puneetjindal @dsdelhi https://github.com/dsdelhi Sorry for the delay. Are we still seeing this issue on TF1.4 Cloud ML Engine. Can you send me your project_id if you are still seeing the issue. @puneetjindal https://github.com/puneetjindal I don't see your project id in this thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/cloudml-samples/issues/104#issuecomment-360836947, or mute the thread https://github.com/notifications/unsubscribe-auth/AgoaO9M2j20OGCt94jSz9dANkirp-mYsks5tOgBTgaJpZM4QJ7YB .

puneith commented 6 years ago

@dsdelhi Where did you share it?

girijaravishankar commented 6 years ago

I have the same problem, although I see that GC MLE now uses TF 1.4

girijaravishankar commented 6 years ago

I solved this by deploying with an explicit argument --runtime-version 1.4 did it for me.

puneith commented 6 years ago

The default runtime_version for CMLE is still 1.0 and you need to specify the runtime-version as @girijaravishankar mentions in comment above. Anyone still facing the issue please reopen this.