OlafenwaMoses / ImageAI

A python library built to empower developers to build applications and systems with self-contained Computer Vision capabilities
https://www.genxr.co/#products
MIT License
8.65k stars 2.19k forks source link

Training with 4 GPU #624

Open Overdoze47 opened 3 years ago

Overdoze47 commented 3 years ago

First of all thanks for the update to Tensorflow 2.4.0. You are doing a great job!

Now to my problem:

I would like to train a Yolov3 model with 4 GPU's. The problem is that it loads all four GPU's with data but only uses the first one effectively when training.

image

If I only use one Nvidia T4, one epoch currently takes 1:40 hour. With all four GPUs, the training time of one epoch is 3:30 hours.

Environment I use a Google Cloud instance with it: 4 x Nvidia T4 16V Core CPU 60 GB RAM SSD

Pip list `Package Version


absl-py 0.11.0 asn1crypto 0.24.0 astor 0.8.1 astunparse 1.6.3 cachetools 4.2.0 certifi 2020.12.5 chardet 4.0.0 cryptography 2.1.4 cycler 0.10.0 flatbuffers 1.12 gast 0.3.3 google-auth 1.24.0 google-auth-oauthlib 0.4.2 google-pasta 0.2.0 grpcio 1.32.0 h5py 2.10.0 idna 2.10 imageai 2.1.6 importlib-metadata 2.1.1 Keras 2.4.3 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 keras-resnet 0.2.0 keyring 10.6.0 keyrings.alt 3.0 kiwisolver 1.1.0 Markdown 3.2.2 matplotlib 3.3.2 mock 3.0.5 numpy 1.19.3 oauthlib 3.1.0 opencv-python 4.2.0.32 opt-einsum 3.3.0 Pillow 7.0.0 pip 20.3.3 protobuf 3.14.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycrypto 2.6.1 pygobject 3.26.1 pyparsing 2.4.7 python-dateutil 2.8.1 pyxdg 0.25 PyYAML 5.3.1 requests 2.25.1 requests-oauthlib 1.3.0 rsa 4.6 scipy 1.4.1 SecretStorage 2.3.1 setuptools 51.1.1 six 1.15.0 tensorboard 2.4.0 tensorboard-plugin-wit 1.7.0 tensorflow 2.4.0 tensorflow-estimator 2.4.0 termcolor 1.1.0 typing-extensions 3.7.4.3 urllib3 1.26.2 Werkzeug 1.0.1 wheel 0.36.2 wrapt 1.12.1 zipp 1.2.0`

Training script from imageai.Detection.Custom import DetectionModelTrainer trainer = DetectionModelTrainer() trainer.setModelTypeAsYOLOv3() trainer.setDataDirectory(data_directory="/opt/ki/") trainer.setTrainConfig(object_names_array=["Apfel_Jonagold", "Apfel_Kanzi", "Clementine_lose", "Orange_lose", "Zwiebel_lose", "Zitrone_lose", "Limetten_lose", "Schalotten_lose"], batch_size=16, num_experiments=100, train_from_pretrained_model="./models/pretrained.h5") trainer.setGpuUsage([0, 1, 2, 3]) trainer.trainModel()

Output training `2021-01-09 18:37:23.734317: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudart.so.11.0 Generating anchor boxes for training images and annotation... no element found: line 1, column 0 Ignore this bad annotation: /opt/ki/train/annotations/UwJf7aXd9FhczV7DeiUbh5EK1y76cT.xml Average IOU for 9 anchors: 0.89 Anchor Boxes generated. Detection configuration saved in /opt/ki/json/detection_config.json Evaluating over 3696 samples taken from /opt/ki/validation Training over 10712 samples given at /opt/ki/train Training on: ['Apfel_Jonagold', 'Apfel_Kanzi', 'Clementine_lose', 'Limetten_lose', 'Orange_lose', 'Schalotten_los e', 'Zitrone_lose', 'Zwiebel_lose'] Training with Batch Size: 16 Number of Training Samples: 10712 Number of Validation Samples: 3696 Number of Experiments: 100 2021-01-09 18:37:57.548894: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xlaenable xla_devices not set 2021-01-09 18:37:57.549874: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcuda.so.1 2021-01-09 18:37:57.742205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.743185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:

pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:57.743301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.744248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:

pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:57.744336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.745256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:

pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:57.745350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.746301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:

pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:57.746363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudart.so.11.0 2021-01-09 18:37:57.748970: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcublas.so.11 2021-01-09 18:37:57.749054: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcublasLt.so.11 2021-01-09 18:37:57.750273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcufft.so.10 2021-01-09 18:37:57.750577: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcurand.so.10 2021-01-09 18:37:57.753391: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcusolver.so.10 2021-01-09 18:37:57.754082: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcusparse.so.11 2021-01-09 18:37:57.754244: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudnn.so.8 2021-01-09 18:37:57.754354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.755377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.756389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.757406: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.758433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.759400: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.760408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.761386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:57.762274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3 2021-01-09 18:37:57.764176: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xlaenable xla_devices not set 2021-01-09 18:37:58.264258: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.265257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:

pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:58.265433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.266351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:

pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:58.266484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.267466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:

pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:58.267601: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.268569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:

pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-09 18:37:58.268624: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudart.so.11.0 2021-01-09 18:37:58.268654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcublas.so.11 2021-01-09 18:37:58.268669: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcublasLt.so.11 2021-01-09 18:37:58.268680: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcufft.so.10 2021-01-09 18:37:58.268702: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcurand.so.10 2021-01-09 18:37:58.268723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcusolver.so.10 2021-01-09 18:37:58.268747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcusparse.so.11 2021-01-09 18:37:58.268769: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudnn.so.8 2021-01-09 18:37:58.268897: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.269923: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.271152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.272194: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.273238: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.274258: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.275314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.276326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:37:58.277314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3 2021-01-09 18:37:58.277378: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcudart.so.11.0 2021-01-09 18:38:00.353305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecu tor with strength 1 edge matrix: 2021-01-09 18:38:00.353344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 1 2 3 2021-01-09 18:38:00.353351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N Y N N 2021-01-09 18:38:00.353355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1: Y N N N 2021-01-09 18:38:00.353358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 2: N N N Y 2021-01-09 18:38:00.353362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 3: N N Y N 2021-01-09 18:38:00.353725: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.354758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.355774: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.356825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.357857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.358833: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.359715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job :localhost/replica:0/task:0/device:GPU:0 with 13909 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus i d: 0000:00:04.0, compute capability: 7.5) 2021-01-09 18:38:00.360340: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.361368: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.362397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job :localhost/replica:0/task:0/device:GPU:1 with 13968 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus i d: 0000:00:05.0, compute capability: 7.5) 2021-01-09 18:38:00.362901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.363865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.364764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job :localhost/replica:0/task:0/device:GPU:2 with 13968 MB memory) -> physical GPU (device: 2, name: Tesla T4, pci bus i d: 0000:00:06.0, compute capability: 7.5) 2021-01-09 18:38:00.365196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.366213: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read fr om SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-09 18:38:00.367109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job :localhost/replica:0/task:0/device:GPU:3 with 13968 MB memory) -> physical GPU (device: 3, name: Tesla T4, pci bus i d: 0000:00:07.0, compute capability: 7.5) Training with transfer learning from pretrained Model WARNING:tensorflow:period argument is deprecated. Please use save_freq to specify the frequency in number of bat ches seen. WARNING:tensorflow:epsilon argument is deprecated and will be removed, use min_delta instead. 2021-01-09 18:38:09.320280: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing. 2021-01-09 18:38:09.320322: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started. 2021-01-09 18:38:09.320355: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 4 GPUs 2021-01-09 18:38:09.321404: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dyna mic library libcupti.so.11.0 2021-01-09 18:38:10.110443: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down. 2021-01-09 18:38:10.110682: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flus hed /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1844: UserWarning: Model.fit_gene rator is deprecated and will be removed in a future version. Please use Model.fit, which supports generators. warnings.warn('Model.fit_generator is deprecated and ' /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:3504: UserWarning: Even though the tf.config.experimental_run_functions_eagerly option is set, this option does not apply to tf.data functions. tf.data functions are still traced and executed as graphs. "Even though the tf.config.experimental_run_functions_eagerly " WARNING:tensorflow:Model failed to serialize as JSON. Ignoring... Layer YoloLayer has arguments in __init__ and th erefore must override get_config. 2021-01-09 18:38:10.929492: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimiz ation passes are enabled (registered 2)`

How can I effectively use all four T4s to significantly speed up my training? I've been trying to do the training with four GPU's for a while now but can't get any further.

Thanks in advance

Edwin-Aguirre92 commented 2 years ago

Hello Oversoze47,

I have the same issue. I'm wondering if you have solved it ? I actually managed to get all 4 GPUs to train but I had to modified the source code significantly to understand what it was doing. Further, when I do the data parallelism approach using mirror strategy , my GPUs do not all use their % capacity. This can be seen here :

image