GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 859 forks source link

Hyper parameter tuner cannot find objective value in eager mode #408

Closed katotetsuro closed 5 years ago

katotetsuro commented 5 years ago

Describe the bug When training job is running in eager mode, hyper parameter tuner cannot find objective value. It happpens on the server(AI Platform). Screen Shot 2019-05-08 at 18 06 46

What sample is this bug related to? census/tf-keras

Source code / logs I added this one line code

tf.enable_eager_execution()

at the beginning of training.

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/83d45fb189cf69d1716ee2d99c39a12580982827/census/tf-keras/trainer/task.py#L128

To Reproduce Steps to reproduce the behavior:

  1. Add tf.enable_eager_execution()
  2. exec this command from census/tf-keras directory. gcloud ai-platform jobs submit training [jobname] --module-name trainer.task --package-path trainer/ --runtime-version 1.13 --python-version 3.5 --job-dir [jobdir] --config hptuning_config.yaml
  3. after job has finished, gcloud ai-platform jobs describe [jobname] and there is no finalMetric entries.

Expected behavior Hyper parameter tuner can find objective value.

System Information

belows are AI platform's runtime infomation.

Additional context I know eager mode isn't necessary for census/tf-keras example, but I want to use eager mode for my private project. I guess this problem is related to difference of tensorboard's summary writer between eager and graph.

gogasca commented 5 years ago

Will try to reproduce this issue and get back to you this week, thanks

gogasca commented 5 years ago

@katotetsuro I was able to reproduce the problem, will get back to you. Since we are using the TensorBoard callback to write stats: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/callbacks_v1.py#L133 You can see that Weight and gradient histograms not supported for eager, will see if there is a workaround to write stats without the TB callback.

WARNING 2019-06-19 10:03:09 -0700   master-replica-0    4   Weight and gradient histograms not supported for eagerexecution, setting `histogram_freq` to `0`.
gogasca commented 5 years ago

https://www.tensorflow.org/api_docs/python/tf/summary/FileWriter FileWriter is not compatible with eager execution. To write TensorBoard summaries under eager execution, use tf.contrib.summary instead.

gogasca commented 5 years ago

Take a look at this example: https://medium.com/@konpat/tensorflow-summary-api-v2-5fa760d04680

andrewferlitsch commented 5 years ago

Closed as answered.