aws-samples / amazon-sagemaker-tensorflow-object-detection-api

Train and deploy models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
MIT No Attribution
44 stars 34 forks source link

GPU is not utilized when training. #8

Closed t-T-s closed 3 years ago

t-T-s commented 3 years ago

I am using the SSD MobileNet V2 FPNLite 320x320 which is supported for GPU training from detection zoo in tensorflow. I did several tries to make this show some GPUUtilization in cloud watch. But no luck ! Following is what I have tried.

I tried to use both ml.p2.xlarge and ml.p2.8xlarge. (in two separate occasions). I used a larger instance because I thought this was the issue initially.

I did the modifications to file which is downloaded to _sourcedir using the notebook 2_train_model/train_model.ipynb.

Since the tpu name can not be obtained from sagemaker (as I think, I don't know actually for sure), use_tpu=True can not be used in sagemaker. Also there are two other issues why I think I can not use the tpu in sagemaker First, it actually has GPUs and not TPU. Secondly, the function resolver in is only supported in Google Cloud according to the documentation in tensorflow.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)

TPUClusterResolver supports the following distinct environments: Google Compute Engine Google Kubernetes Engine Google internal

So the workarounds I tried falls under two categories.

  1. Following modification is for the p2.xlarge. Since it has only one GPU, I left use_tpu=False and modified the code to following:
strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0"])

But still the error remained unchanged. (Error is mentioned at the end)

  1. Following modification is for the p2.8xlarge. Since it has 8 GPUs, I left use_tpu=False and modified the code to following:
    strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1","/gpu:2","/gpu:3","/gpu:4","/gpu:5","/gpu:6","/gpu:7"])

    Despite those I still didn't get a single pulse of GPUUtilization.


Then I tried to use the parameter sever from the estimator arguments as follows. Still no luck !

distributions = {'parameter_server': {
                    'enabled': True}

Error log :

WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
W1228 02:54:48.062565 139853066053440] Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1228 02:54:48.065709 139853066053440] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1228 02:54:48.066459 139853066053440] Cluster Resolver: None
INFO:tensorflow:Maybe overwriting train_steps: 3000
I1228 02:54:48.072855 139853066053440] Maybe overwriting train_steps: 3000
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I1228 02:54:48.073027 139853066053440] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/ StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W1228 02:54:48.147808 139853066053440] From /usr/local/lib/python3.6/dist-packages/object_detection/ StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['/opt/ml/input/data/train/train.records-?????-of-00005']
I1228 02:54:48.157841 139853066053440] Reading unweighted datasets: ['/opt/ml/input/data/train/train.records-?????-of-00005']
INFO:tensorflow:Reading record datasets for input file: ['/opt/ml/input/data/train/train.records-?????-of-00005']
I1228 02:54:48.159540 139853066053440] Reading record datasets for input file: ['/opt/ml/input/data/train/train.records-?????-of-00005']
INFO:tensorflow:Number of filenames to read: 5
I1228 02:54:48.159699 139853066053440] Number of filenames to read: 5
WARNING:tensorflow:num_readers has been reduced to 5 to match input file shards.
W1228 02:54:48.159873 139853066053440] num_readers has been reduced to 5 to match input file shards.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/ parallel_interleave (from is deprecated and will be removed in a future version.
Instructions for updating:
Use `, cycle_length, block_length,` instead. If sloppy execution is desired, use ``.
W1228 02:54:48.166253 139853066053440] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/ parallel_interleave (from is deprecated and will be removed in a future version.
Instructions for updating:
Use `, cycle_length, block_length,` instead. If sloppy execution is desired, use ``.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/ DatasetV1.map_with_legacy_function (from is deprecated and will be removed in a future version.
Instructions for updating:
Use `
W1228 02:54:48.203318 139853066053440] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/ DatasetV1.map_with_legacy_function (from is deprecated and will be removed in a future version.
Instructions for updating:
Use `

For thep2.8xlarge, the only difference was that all 8 GPUs were invisible to Tensorflow. What caught my eye was the first several lines: Some requested devices in tf.distribute.Strategy are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0

It's really a great repository, I would love to use this. But this just keeps bugging me on my research which is a time sensitive case for me.

Can you please give me a workaround or even a fix ?

@SofianHamiti @Othmane796

sofianhamiti commented 3 years ago

@t-T-s have you tried to pull the latest version of the repo? It has been fixed after adjusting the TF and TF OD API versions

surya1011 commented 3 years ago

@SofianHamiti I have tried creating the docker image on 20th April, 2021 still no luck. Even I have the same issue. I have tried setting the FLAGS.use_tpu to false, but still I am getting the same error

rcruzgar commented 3 years ago

Hi @t-T-s @SofianHamiti @surya1011 ,

In my case I am able to use the GPU memory following the current version of the repo with ml.p3.8xlarge. I am choosing 300 steps with the generated sample data, and it takes 809 training seconds:

Step 100 per-step time 3.868s
Step 200 per-step time 0.260s
Step 300 per-step time 0.262s

These are the utilization graphs:


I see 2 problems here: 1) For 4 GPUs, the GPU utilization is null. Is it right? 2) 809 seconds of training time for this small data sample is a considerable amount of time. Imagine training for 200k steps. Do you see it as a normal training time or do you think it should run faster?

I tried with other instances, in this case with ml.p3.2xlarge, with one 1 GPU, for the same steps and data:


2 things to highlight: 1) GPU utilization is higher. 2) Training time is shorter: 677 seconds. Still an important time period, in my opinion.

The timings for the 300 steps:

Step 100 per-step time 1.930s
Step 200 per-step time 0.970s
Step 300 per-step time 0.966s

Do have a reference of the typical training period? Also any suggestion of what I could be doing wrong would be appreciated.

If you consider this should be in another issue, please don't hesitate to tell me.

Best regards, Rubén.