Train model on Google colab with TPU

eneshb commented 4 years ago

Thanks for the tutorials. I've achieved to get the training work on Google colab in a runtime with setted GPU. In the object_detection folder there is a script to train the model with Google's tpu (model_tpu_main.py). When I start this script with the same flags you've used in the model_main.py, it surprisingly is detecting the tpu. But it crashes because a mismatch

INFO:tensorflow:TPU job name tpu_worker

I1122 02:07:09.691519 139921392695168 tpu_estimator.py:506] TPU job name tpu_worker INFO:tensorflow:Graph was finalized. I1122 02:07:13.428853 139921392695168 monitored_session.py:240] Graph was finalized. ERROR:tensorflow:Error recorded from training_loop: From /job:tpu_worker/replica:0/task:0: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /root/datalab/pretrained_model/model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file:

'/root/datalab/pretrained_model/model.ckpt')

Do you maybe now why? Do I have to use different flags? Or do you know another way to train on Google colab with tpu?

TannerGilbert commented 4 years ago

Didn't try it myself but it seems like you need to pass it a tpu_zone flag. I will try it out myself when I find the time.

It would also be good if you ask the question on the official tensorflow models repository so more people can see it.

limsijie93 commented 4 years ago

Thanks for the tutorials. I've achieved to get the training work on Google colab in a runtime with setted GPU. In the object_detection folder there is a script to train the model with Google's tpu (model_tpu_main.py). When I start this script with the same flags you've used in the model_main.py, it surprisingly is detecting the tpu. But it crashes because a mismatch

INFO:tensorflow:TPU job name tpu_worker I1122 02:07:09.691519 139921392695168 tpu_estimator.py:506] TPU job name tpu_worker INFO:tensorflow:Graph was finalized. I1122 02:07:13.428853 139921392695168 monitored_session.py:240] Graph was finalized. ERROR:tensorflow:Error recorded from training_loop: From /job:tpu_worker/replica:0/task:0: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /root/datalab/pretrained_model/model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: '/root/datalab/pretrained_model/model.ckpt')

Do you maybe now why? Do I have to use different flags? Or do you know another way to train on Google colab with tpu?

@eneshb Could you share how you managed to use the GPU setting to run the Tensorflow object detection API? It seems like the readme here doesn't come with instructions to set up GPU.

TannerGilbert commented 4 years ago

To enable GPU in Google Colab you need to go to Runtime > Change runtime type and select GPU.

Regarding the Tensorflow Object Detection API, it will automatically use GPU if Tensorflow detects and compatible GPU.

TannerGilbert commented 4 years ago

Found the following stackoverflow question on the topic: https://stackoverflow.com/questions/51965950/tensorflow-object-detection-api-w-tpu-training-display-more-granular-tensorbo

I will close this issue for now. If you find any further information feel free to add it here.

ngoanpv commented 4 years ago

You should put the checkpoint folder to Google cloud storage as gs:// to train with TPU

TannerGilbert commented 4 years ago

Also take a look at the new Training and Evaluation with TensorFlow 2 documentation, which includes a training with TPU section.

satya400 commented 4 years ago

Hi Gilbert, I just started exploring TF2 OD - So just wanted to check whether there are any tutorial or step-by-step guide for retraining a pretrained model from model zoo using TPU on google colab? Everything i could find on google is referring to TPU setup on GCS. I am successful in retraining on colab using CPU/GPU but could not progress with TPU. Any pointers will be of great help.

Thanks Satya

TannerGilbert commented 4 years ago

Hey Satya,

I also only saw TPU scripts that are using Google Cloud and I personally haven't tried using TPU. Maybe it's best to create a issue for the official repository so more people can help you.

satya400 commented 4 years ago

Thanks Gilbert for the quick reply. I did some trials and noted that the following is working to train on a TPU in google colab - However it needs the training data as well as the checkpoints to be stored on GCS which is not very convenient.

!python ../models/research/object_detection/model_main_tf2.py
--use_tpu = true --pipeline_config_path=XXXXXXXXXX
--model_dir=XXXXXXXXXXXXXXXXX --num_train_steps=XXXXXX --num_eval_steps=XXX --training=true

We just need to change the Runtime to TPU in Colab and then the model_main_tf2.py has some initialization code which automatically recognizes the TPU name etc., which is very convenient.

One more observation - if we call model_tpu_main.py directly, it throws some exceptions. Hence model_main_tf2.py is the best way.

With the above command, i could proceed further but did not proceed to actual training as i do not have access to GCS.

It would have been better if we get some facility to use the file system on colab itself for self-learning projects. I will check for any clues in the official repository.

Thanks Satya

TannerGilbert / Tensorflow-Object-Detection-API-Train-Model

Train model on Google colab with TPU #1