aws-samples / amazon-sagemaker-tensorflow-object-detection-api

Train and deploy models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
MIT No Attribution
44 stars 34 forks source link

GPU is not utilized when training. #8

Closed t-T-s closed 3 years ago

t-T-s commented 3 years ago

I am using the SSD MobileNet V2 FPNLite 320x320 which is supported for GPU training from detection zoo in tensorflow. I did several tries to make this show some GPUUtilization in cloud watch. But no luck ! Following is what I have tried.

I tried to use both ml.p2.xlarge and ml.p2.8xlarge. (in two separate occasions). I used a larger instance because I thought this was the issue initially.

I did the modifications to model_main_tf2.py file which is downloaded to _sourcedir using the notebook 2_train_model/train_model.ipynb.

Since the tpu name can not be obtained from sagemaker (as I think, I don't know actually for sure), use_tpu=True can not be used in sagemaker. Also there are two other issues why I think I can not use the tpu in sagemaker First, it actually has GPUs and not TPU. Secondly, the function resolver in model_main_tf2.py is only supported in Google Cloud according to the documentation in tensorflow.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(FLAGS.tpu_name)

TPUClusterResolver supports the following distinct environments: Google Compute Engine Google Kubernetes Engine Google internal

So the workarounds I tried falls under two categories.

  1. Following modification is for the p2.xlarge. Since it has only one GPU, I left use_tpu=False and modified the code to following:
strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0"])

But still the error remained unchanged. (Error is mentioned at the end)

  1. Following modification is for the p2.8xlarge. Since it has 8 GPUs, I left use_tpu=False and modified the code to following:
    strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1","/gpu:2","/gpu:3","/gpu:4","/gpu:5","/gpu:6","/gpu:7"])

    Despite those I still didn't get a single pulse of GPUUtilization.

image

Then I tried to use the parameter sever from the estimator arguments as follows. Still no luck !

distributions = {'parameter_server': {
                    'enabled': True}

Error log :

===TRAINING THE MODEL==
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
W1228 02:54:48.062565 139853066053440 cross_device_ops.py:1318] Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1228 02:54:48.065709 139853066053440 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1228 02:54:48.066459 139853066053440 model_main_tf2.py:105] Cluster Resolver: None
INFO:tensorflow:Maybe overwriting train_steps: 3000
I1228 02:54:48.072855 139853066053440 config_util.py:552] Maybe overwriting train_steps: 3000
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I1228 02:54:48.073027 139853066053440 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W1228 02:54:48.147808 139853066053440 deprecation.py:339] From /usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['/opt/ml/input/data/train/train.records-?????-of-00005']
I1228 02:54:48.157841 139853066053440 dataset_builder.py:148] Reading unweighted datasets: ['/opt/ml/input/data/train/train.records-?????-of-00005']
INFO:tensorflow:Reading record datasets for input file: ['/opt/ml/input/data/train/train.records-?????-of-00005']
I1228 02:54:48.159540 139853066053440 dataset_builder.py:77] Reading record datasets for input file: ['/opt/ml/input/data/train/train.records-?????-of-00005']
INFO:tensorflow:Number of filenames to read: 5
I1228 02:54:48.159699 139853066053440 dataset_builder.py:78] Number of filenames to read: 5
WARNING:tensorflow:num_readers has been reduced to 5 to match input file shards.
W1228 02:54:48.159873 139853066053440 dataset_builder.py:86] num_readers has been reduced to 5 to match input file shards.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W1228 02:54:48.166253 139853066053440 deprecation.py:339] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W1228 02:54:48.203318 139853066053440 deprecation.py:339] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()

For thep2.8xlarge, the only difference was that all 8 GPUs were invisible to Tensorflow. What caught my eye was the first several lines: Some requested devices in tf.distribute.Strategy are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0

It's really a great repository, I would love to use this. But this just keeps bugging me on my research which is a time sensitive case for me.

Can you please give me a workaround or even a fix ?

@SofianHamiti @Othmane796

sofianhamiti commented 3 years ago

@t-T-s have you tried to pull the latest version of the repo? It has been fixed after adjusting the TF and TF OD API versions

surya1011 commented 3 years ago

@SofianHamiti I have tried creating the docker image on 20th April, 2021 still no luck. Even I have the same issue. I have tried setting the FLAGS.use_tpu to false, but still I am getting the same error

rcruzgar commented 3 years ago

Hi @t-T-s @SofianHamiti @surya1011 ,

In my case I am able to use the GPU memory following the current version of the repo with ml.p3.8xlarge. I am choosing 300 steps with the generated sample data, and it takes 809 training seconds:

Step 100 per-step time 3.868s
Step 200 per-step time 0.260s
Step 300 per-step time 0.262s

These are the utilization graphs:

image

I see 2 problems here: 1) For 4 GPUs, the GPU utilization is null. Is it right? 2) 809 seconds of training time for this small data sample is a considerable amount of time. Imagine training for 200k steps. Do you see it as a normal training time or do you think it should run faster?

I tried with other instances, in this case with ml.p3.2xlarge, with one 1 GPU, for the same steps and data:

image

2 things to highlight: 1) GPU utilization is higher. 2) Training time is shorter: 677 seconds. Still an important time period, in my opinion.

The timings for the 300 steps:

Step 100 per-step time 1.930s
Step 200 per-step time 0.970s
Step 300 per-step time 0.966s

Do have a reference of the typical training period? Also any suggestion of what I could be doing wrong would be appreciated.

If you consider this should be in another issue, please don't hesitate to tell me.

Best regards, Rubén.