captain-pool / GSOC

Repository for Google Summer of Code 2019 https://summerofcode.withgoogle.com/projects/#4662790671826944
MIT License
68 stars 22 forks source link

TPU Estimator Crashing #12

Open captain-pool opened 5 years ago

captain-pool commented 5 years ago

Tensorflow version: tensorflow==2.0.0b0 Tensorflow Datasets Version: tfds-nightly==1.0.2.dev201906090105 Tensorflow Hub Version: tf-hub-nightly==0.5.0.dev201905270046

Issue

Code Raises End of sequence [[node input_pipeline_task0/while/IteratorGetNext (defined at image_retraining_tpu.py:139) ]] for All values of max_steps in TPUEstimator.train(...)

Reproduce the issue

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=8

The Same error rises for

--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=4
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=100
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=500
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=1000

Line 139

https://github.com/captain-pool/GSOC/blob/513a0ec2a34094c702eb3e3c7197bff5037c9610/E1_TPU_Sample/image_retraining_tpu.py#L135-L139

Log file

Error starts from Line 230 of output.log output.log

CC: @srjoglekar246 @vbardiovskyg

srjoglekar246 commented 5 years ago

This looks likes a bug with the TPUEstimator. As far as I understand this part of the docs, the Estimator API handles the OutofRange error from the input data function by stopping iterations (and not raising an exception). TPUEstimator doesn't seem to behave that way yet. Can you open an issue on TF to cross-check? Also, does the script work with the try...except block?

captain-pool commented 5 years ago

Nope it doesn't. Actually, weirdly enough the code doesn't stop running. It keeps on saying that TPU is Healthy and tries to refresh the token and Doesn't break out, even if there's no more code to execute.