deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

Add (experimental) multi-gpu to Kur. #33

Closed antho-rousseau closed 7 years ago

antho-rousseau commented 7 years ago

The "parallel" option in backend specification of the kurfile is either 1 (single GPU) or > 1 (multi GPU).

Like:

backend: &backend
  name: keras
  backend: tensorflow
  parallel: 2

Will only work with the tensorflow keras backend.

ajsyp commented 7 years ago

I am so excited to try this out. Let me test a few things on our cluster. This is an awesome inclusion, @antho-rousseau!

ajsyp commented 7 years ago

I can confirm that this is working for some simple models, but it outright breaks for other models. It's odd, frankly. It's as if Keras does not handle TimeDistributed wrappers correctly when the model is being parallelized, mapping them "too deep" when the outputs = model(inputs) is being called in make_parallel().

Here is code to recreate the problem.

This is the output

I modified Keras Dense.get_output_shape_for() in keras/layers/core.py to print out extra debugging information, and this is what I see.

Interestingly, it looks like the Dense layer in the TimeDistributed wrapper is being mapped onto the last dimension, rather than a single dimension forward. This may be a bug in Keras, and we may need to file an issue over there.

ajsyp commented 7 years ago

Can you confirm?

ajsyp commented 7 years ago

Just to double-check, the model is valid. If you try to use it without parallelism, it will work fine:

# ... create the model, as in the gist

import numpy
x = numpy.random.uniform(low=-1, high=1, size=(100, 32, 32))
y = model.predict_on_batch(x)
ajsyp commented 7 years ago

Bug is caused by the wrong shape for the Lambda function:

$ git diff kur/utils/parallelism.py
diff --git a/kur/utils/parallelism.py b/kur/utils/parallelism.py
index 6dca8f6..4176fc0 100644
--- a/kur/utils/parallelism.py
+++ b/kur/utils/parallelism.py
@@ -57,7 +57,7 @@ def make_parallel(model, gpu_count):
                                # Slice each input into a piece 
                                # for processing on this GPU
                                for x in model.inputs:
-                                       input_shape = tuple(x.get_shape().as_list())[1:]
+                                       input_shape = (None, ) + tuple(x.get_shape().as_list())[1:]
                                        slice_n = Lambda(slice_batch, 
                                                lambda shape: input_shape, 
                                                arguments={'n_gpus':gpu_count, 'part':i})(x)
antho-rousseau commented 7 years ago

@ajsyp Thanks for your help on this Adam! I've just commited the fix, saw no difference in GPU speedup or whatever, seems OK to me now.

ajsyp commented 7 years ago

Here is the fix for the wait_for_compile issue that cause SIGFPE with > 2 GPUs:

diff --git a/kur/backend/keras_backend.py b/kur/backend/keras_backend.py
index 7aa3271..a3279e4 100644
--- a/kur/backend/keras_backend.py
+++ b/kur/backend/keras_backend.py
@@ -651,7 +651,7 @@ class KerasBackend(Backend):

                provider = BatchProvider(
                        sources=dict(zip(model.provider.keys, model.provider.sources)),
-                       batch_size=2,
+                       batch_size=2*self.parallel,
                        num_batches=1,
                        randomize=False
                )
antho-rousseau commented 7 years ago

Thanks! Glad you had > 2 GPUs boxes to check this out! :)