arrayfire / arrayfire-python

Python bindings for ArrayFire: A general purpose GPU library.
https://arrayfire.com
BSD 3-Clause "New" or "Revised" License
416 stars 65 forks source link

Documentation: Multiple GPUs #165

Open georgh opened 6 years ago

georgh commented 6 years ago

I think it would be great to have an example for using multiple GPUs.

Here is what I tried. If thats the right way to do it, then you may add it as an example. It seems to scale fine (tested up to 7 GPUs) and nvidia-smi reports 96% util.

import time
import numpy as np
import arrayfire as af
import argparse
af.set_backend('cuda')

if __name__ == '__main__':
      parser = argparse.ArgumentParser()
      parser.add_argument('gpus', type=int)
      parser.add_argument('-runs', type=int, default=100)
      args = parser.parse_args()

      GPUS = args.gpus
      N = 5000
      runs = args.runs

      # The simple task we want to solve:
      # we have a huge list of vectors X and want to calculate the distance between all of them
      # this will result in a huge distance matrix M
      # the resulting matrix should be multiplied by a vector alpha
      X = np.random.rand(100, N)
      Alpha = np.random.rand(N,1)

      #copy data once:
      xGPU = []
      alphaGPU = []
      for i in range(GPUS):
            af.set_device(i)
            x = af.to_array(X)
            xGPU.append(x)
            alpha = af.to_array(Alpha)
            alphaGPU.append(alpha)

      sub = lambda a,b: a - b 
      print("init finished")
      for _ in range(runs):
            startTime = time.time()
            splitSize = int(np.ceil(N / GPUS))
            #print("Temp data will ocupy at least {:.2f} MB on the gpu.".format((X.shape[0] * splitSize * X.shape[1]) *8 /1024/1024))

            result = []
            for i in range(GPUS):
                  af.set_device(i)
                  x = xGPU[i]
                  alpha = alphaGPU[i]

                  start = i*splitSize
                  end = min((i+1)*splitSize, N) 

                  diff = af.broadcast(sub, af.tile(x[:,start:end],1,1,x.shape[1]), af.moddims(x,x.shape[0],1,x.shape[1]))
                  diff = af.sqrt(af.sum(af.pow(diff,2),0) )
                  r = af.matmul(af.moddims(diff, diff.shape[1], diff.shape[2]), alpha)
                  result.append(r)

            total = 0
            for i in range(GPUS):
                  af.set_device(i)
                  total += af.sum(result[i])

            print("Took {} sec".format(time.time() - startTime ))
georgh commented 6 years ago

Main remaining question would be, how you use the cpu during the gpu computation. Do you split with multiprocessing or is there an easier way?

pavanky commented 6 years ago

@georgh can you send this as a PR ?

Main remaining question would be, how you use the cpu during the gpu computation.

The gpu ops are asynchronous. You can other stuff on the cpu as long as you dont run any synchronizing functions (af.sync or any function that copies memory back to the cpu).