Closed antho-rousseau closed 7 years ago
I am so excited to try this out. Let me test a few things on our cluster. This is an awesome inclusion, @antho-rousseau!
I can confirm that this is working for some simple models, but it outright breaks for other models. It's odd, frankly. It's as if Keras does not handle TimeDistributed wrappers correctly when the model is being parallelized, mapping them "too deep" when the outputs = model(inputs)
is being called in make_parallel().
Here is code to recreate the problem.
I modified Keras Dense.get_output_shape_for()
in keras/layers/core.py
to print out extra debugging information, and this is what I see.
Interestingly, it looks like the Dense layer in the TimeDistributed wrapper is being mapped onto the last dimension, rather than a single dimension forward. This may be a bug in Keras, and we may need to file an issue over there.
Can you confirm?
Just to double-check, the model is valid. If you try to use it without parallelism, it will work fine:
# ... create the model, as in the gist
import numpy
x = numpy.random.uniform(low=-1, high=1, size=(100, 32, 32))
y = model.predict_on_batch(x)
Bug is caused by the wrong shape for the Lambda function:
$ git diff kur/utils/parallelism.py
diff --git a/kur/utils/parallelism.py b/kur/utils/parallelism.py
index 6dca8f6..4176fc0 100644
--- a/kur/utils/parallelism.py
+++ b/kur/utils/parallelism.py
@@ -57,7 +57,7 @@ def make_parallel(model, gpu_count):
# Slice each input into a piece
# for processing on this GPU
for x in model.inputs:
- input_shape = tuple(x.get_shape().as_list())[1:]
+ input_shape = (None, ) + tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(slice_batch,
lambda shape: input_shape,
arguments={'n_gpus':gpu_count, 'part':i})(x)
@ajsyp Thanks for your help on this Adam! I've just commited the fix, saw no difference in GPU speedup or whatever, seems OK to me now.
Here is the fix for the wait_for_compile
issue that cause SIGFPE with > 2 GPUs:
diff --git a/kur/backend/keras_backend.py b/kur/backend/keras_backend.py
index 7aa3271..a3279e4 100644
--- a/kur/backend/keras_backend.py
+++ b/kur/backend/keras_backend.py
@@ -651,7 +651,7 @@ class KerasBackend(Backend):
provider = BatchProvider(
sources=dict(zip(model.provider.keys, model.provider.sources)),
- batch_size=2,
+ batch_size=2*self.parallel,
num_batches=1,
randomize=False
)
Thanks! Glad you had > 2 GPUs boxes to check this out! :)
The "parallel" option in backend specification of the kurfile is either 1 (single GPU) or > 1 (multi GPU).
Like:
Will only work with the tensorflow keras backend.