GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 857 forks source link

criteo_ftf does not run on Google cloud #131

Closed behnamprime closed 6 years ago

behnamprime commented 6 years ago

For problems running the sample code please provide the following information.

System information

README.txt

requirements.txt

To obtain the Tensorflow and Tensorflow Transform environment do

pip freeze |grep tensorflow
pip freeze |grep apache-beam

Describe the problem

I'm just following the criteo_tft example and I want to run it on Google Cloud. It will run locally but it does not run on the cloud. it runs for like an hour and then it fails. I will attach the error. But when I checked the dataflow, the writetransform_fn does not get activated at all.

Source code / logs

Collecting apache-beam==2.2.0 Using cached apache-beam-2.2.0.zip Saved /tmp/tmp0Xyyoq/apache-beam-2.2.0.zip Successfully downloaded apache-beam Traceback (most recent call last): File "preprocess.py", line 242, in main() File "preprocess.py", line 238, in ma

in delimiter=args.delimiter) File "/home/behnam/Desktop/cloudml-samples/criteo_tft/gc/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 346, in exit self.run().wait_until_finish() File "/home/behnam/Desktop/cloudml-samples/criteo_tft/gc/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 966, in wait_until_finish (self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: (3475356fbe77dee7): Workflow failed. Causes: (3475356fbe77d0ae): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.

puneith commented 6 years ago

Can you please send me / look at the information from the Dataflow logs. It looks like a potential VM quota issue, but logs will have more information.

puneith commented 6 years ago

@davidcavazos does your commit have the fix?

davidcavazos commented 6 years ago

I don't think so, I gave this a quick check, but I don't know what's going on there.

puneith commented 6 years ago

@behnamprime can you confirm if this is a VM quota issue from Dataflow console logs.

puneith commented 6 years ago

@behnamprime closing due to inactivity. Feel free to reopen or file new issue.