apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

tc.config.set_num_gpus fails on Ubuntu 16.0.4 with multiple GPUs #734

Closed vade closed 6 years ago

vade commented 6 years ago

Hello

Successfully running Turi Create 5.0b1 on Ubuntu with GPU training and putting Core ML models. Very cool.

Our system has 5 GPUs - 2x 1080s and 3x 1070s. Our Turi Create script sets tc.config.set_num_gpus(-1) but we fail to ever see any GPU other than the first in use.

Is this a known issue? is this subject to hardware config?

System: Ubuntu 16.0.4

NVidia Drivers 390.48, Cuda 9.0,

Topology:

GPU0    GPU1    GPU2    GPU3    GPU4    CPU Affinity

GPU0 X PHB PHB PHB PHB 0-3 GPU1 PHB X PHB PHB PHB 0-3 GPU2 PHB PHB X PHB PHB 0-3 GPU3 PHB PHB PHB X PHB 0-3 GPU4 PHB PHB PHB PHB X 0-3

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.48 Driver Version: 390.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A | |100% 39C P2 58W / 180W | 1067MiB / 8116MiB | 37% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 00000000:02:00.0 Off | N/A | |100% 36C P8 16W / 180W | 786MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1070 Off | 00000000:04:00.0 Off | N/A | |100% 30C P8 19W / 230W | 734MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A | |100% 32C P8 20W / 230W | 640MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 1070 Off | 00000000:09:00.0 Off | N/A | |100% 29C P8 20W / 230W | 546MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+

srikris commented 6 years ago

This is not a known issue. Its unexpected. We will look into this.

vade commented 6 years ago

@srikris Thanks for the very prompt reply. Please let me know if you need any diagnostics. Happy to help.

skercher commented 6 years ago

Hi. We are seeing the same issue:

screen shot 2018-06-21 at 2 49 58 pm

srikris commented 6 years ago

Thanks @vade and @skercher. We'll keep you posted on this.

srikris commented 6 years ago

@vade Are you using Object detection, Image Classification, or Style transfer?

vade commented 6 years ago

We are using Image Classifier. :)

TobyRoseman commented 6 years ago

@vade @skercher - I suspect this is related to batch size. In the latest release we decreased the batch size from 512 to 64. Please try increasing the batch_size parameter of image_classifier.create(...) to 512 and let us know if this solves the issue.

vade commented 6 years ago

Thanks @TobyRoseman - will give this a shot. I've mentioned this to our team and will let you know.

genp commented 6 years ago

Batch size of 512 didn't have any effect. Is the idea that the create function will split batches of > 512 across GPUs? Are weight updates shared across GPUs or are the weights also divided per GPU and updated separately a la AlexNet?

skercher commented 6 years ago

Hi @TobyRoseman. Adding batch size seemed to make the training a bit faster but still didn't utilize all our GPU's.

TobyRoseman commented 6 years ago

I am not able to reproduce this problem on Ubuntu/CUDA. All my GPUs are getting used to some degree.

Could someone who is having this issue please let me know two things: 1 - How many GPUs they have. 2 - The output of the following code:

import turicreate as tc
print(tc.toolkits._mxnet_utils.get_mxnet_context())
vade commented 6 years ago
oscar@MayallsObject ~/Oscar % python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import turicreate as tc
>>> print(tc.toolkits._mxnet_utils.get_mxnet_context())
[gpu(0), gpu(1), gpu(2), gpu(3), gpu(4)]
>>> 
TobyRoseman commented 6 years ago

@vade - and you have five GPUs?

vade commented 6 years ago

@TobyRoseman - thanks for the test - in your successful environment can you please note if you are using the release of Turi Create 5b1, what Cuda version and GL driver version and MXNet version you are using just for comparisons sake?

vade commented 6 years ago

@tbartelmess Correct - 2 1080's, 3 1070s.

vade commented 6 years ago

My Nvidia SMI output from the machine I ran the >>> print(tc.toolkits._mxnet_utils.get_mxnet_context()) code on:

oscar@MayallsObject ~/Oscar % nvidia-smi 
Wed Jun 27 16:47:14 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67                 Driver Version: 390.67                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
|100%   40C    P2    58W / 180W |   6844MiB /  8116MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:02:00.0 Off |                  N/A |
|100%   37C    P8    17W / 180W |   7230MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    Off  | 00000000:04:00.0 Off |                  N/A |
|100%   33C    P8    20W / 230W |   4768MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1070    Off  | 00000000:05:00.0 Off |                  N/A |
|100%   34C    P8    20W / 230W |   4768MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1070    Off  | 00000000:09:00.0 Off |                  N/A |
|100%   32C    P8    20W / 230W |   7260MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
vade commented 6 years ago

Additional info if pertinent:

scar@MayallsObject ~/Oscar % nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    CPU Affinity
GPU0     X  PHB PHB PHB PHB 0-3
GPU1    PHB  X  PHB PHB PHB 0-3
GPU2    PHB PHB  X  PHB PHB 0-3
GPU3    PHB PHB PHB  X  PHB 0-3
GPU4    PHB PHB PHB PHB  X  0-3

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
TobyRoseman commented 6 years ago

@vade, My version info - turicreate: 5.0b1. MXNet: mxnet-cu80==1.1.0. Ubuntu: 16.04 Cuda: release 8.0, V8.0.44 Driver Version: 375.66

How many examples are in your dataset?

For the GPU which is doing the work, what is the typical utilization? For the other GPUs, is there ever any non-zero utilization?

vade commented 6 years ago

Thanks. Perhaps the issue is we are running a different version of mxNet 1.1.0 against cuda 9.0, with a newer driver. We spent a lot of time trying to get a variety of tool chains to 'agree' on a driver / cuda version since turi isnt the only ML framework we use.

Examples in data sets range from thousands to hundreds of thousands

The GPU typical load for a single run of a batch size 512 is roughly 20 - 30%. Running multiple concurrent training jobs garners up to 60 max Ive seen while monitoring, nary any GPU utilization across any GPU other than the zeroth.

skercher commented 6 years ago

@TobyRoseman Let me know if you need anything else.

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import turicreate as tc print(tc.toolkits._mxnet_utils.get_mxnet_context()) [gpu(0), gpu(1), gpu(2), gpu(3)]

nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X NV1 NV1 NV2 0-39 GPU1 NV1 X NV2 NV1 0-39 GPU2 NV1 NV2 X NV1 0-39 GPU3 NV2 NV1 NV1 X 0-39

afranklin commented 6 years ago

Dupe of #741

vade commented 6 years ago

Can someone explain why this issue was closed? @afranklin ?

srikris commented 6 years ago

@vade This is a dupe of #741. I closed the more recent filing instead of the earlier one.

srikris commented 6 years ago

@afranklin: On closer inspection. These are not the same issue. @vade is not able to see multiple GPUs being used while #741 is about saturating multiple GPUs.

vade commented 6 years ago

Would a Turi Create engineer kindly comment on whether:

Thank you.

TobyRoseman commented 6 years ago

@vade - I would expect it to work with newer versions of Cuda and newer drivers. However that has not been tested. Please clarify your second question: what do you mean by an "alternate builds of MXNet 1.1.0"? Are you talking about a different release version of MxNet? Or something other than mxnet-cu80?

vade commented 6 years ago

@TobyRoseman for example mxnet-cu90--1.1.0 same version number, but expects cuda 9.

TobyRoseman commented 6 years ago

@vade - thanks for clarifying. I would also expect that to work, but we have not tested it.

Could everyone else having this problem, please let us know which version of Cuda you are using?

TobyRoseman commented 6 years ago

I finally got a chance test this on a two GPU system with CUDA 9. Using the most recent version of TuriCreate (5.0b3) I was not able to reproduce this issue; both GPUs were getting used about the same amount. I was also using mxnet-cu90==1.1.0.

Feel free to reopen if this is still and issue with the most recent version of TuriCreate.

vade commented 6 years ago

Hi Toby - thanks for testing. Can you please document the specific Nvidia driver version and Cuda version (which cuda 9 point release?)

Thanks! Happy to verify with your versions shortly!

cc @genp ;)

TobyRoseman commented 6 years ago

@vade - Cuda: release 9.0, V9.0.176 NVIDIA Driver Version: 390.30

jidebingfeng commented 5 years ago

@TobyRoseman

@vade, My version info - turicreate: 5.0b1. MXNet: mxnet-cu80==1.1.0. Ubuntu: 16.04 Cuda: release 8.0, V8.0.44 Driver Version: 375.66

How many examples are in your dataset?

For the GPU which is doing the work, what is the typical utilization? For the other GPUs, is there ever any non-zero utilization?

The size of example dataset will affect the GPUs utilization?