astooke / Synkhronos

Extension to Theano for multi-GPU data parallelism
MIT License
20 stars 5 forks source link

Single batch inputs? #3

Closed mharradon closed 7 years ago

mharradon commented 7 years ago

My dataset is too large to fit into memory at once, and right now my function is written so I pass in my training examples as inputs rather than just indices and using shared variables. So to get started I'd like to just call the function on a single minibatch rather than use the in-memory batch functionality.

My code is basically this:

  train_func = synk.function(inputs=[self.input_var,self.batch_weights], \
                                        outputs=[loss, \
                                        T.stack([l for l in losses.values()]), \
                                        self.decoder_out_mean, \
                                        self.decoder_out_logstd, \
                                        self.decoder_out_a_s] + \
                                        grad_response, \
                                        on_unused_input='warn', \
                                        allow_input_downcast=True,
                                        updates = updates)

def wrapped_func(input_data, batch_weights):
  inputs = (input_data, batch_weights)
  inputs = train_func.build_inputs(*inputs)
  return train_func(*inputs)

Here batch weights is (batch_size,1) and input_data is (batch_size,1,x,y,z).

Calling this function results in the following:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/worker.py", line 68, in worker_main
    synk_fs[sub_ID]()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/function_module.py", line 285, in __call__
    my_results = self._run_function(num_slices, output_subset)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/function_module.py", line 135, in _run_function
    self._functions.f(*my_inputs, output_subset=output_subset)
  File "/home/ubuntu/MyCode/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ubuntu/MyCode/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/ubuntu/MyCode/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices start at 0) has shape[0] == 256, but the output's size on that axis is 16.
Apply node that caused the error: GpuElemwise{Composite{(i0 + (i1 * i2))}}[]<gpuarray>(GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{exp((i0 * i1 * i2))}}[]<gpuarray>.0, GpuReshape{2}.0)
Toposort index: 2202
Inputs types: [GpuArrayType<None>(float32, matrix), GpuArrayType<None>(float32, matrix), GpuArrayType<None>(float32, matrix)]
Inputs shapes: [(16, 4096), (16, 4096), (256, 4096)]
Inputs strides: [(16384, 4), (16384, 4), (16384, 4)]
Inputs values: ['not shown', 'not shown', 'not shown']
Outputs clients: [[InplaceGpuDimShuffle{1,0}(GpuElemwise{Composite{(i0 + (i1 * i2))}}[]<gpuarray>.0), GpuDot22(GpuElemwise{Composite{(i0 + (i1 * i2))}}[]<gpuarray>.0, dec_as_0.W), GpuDot22(GpuElemwise{Composite{(i0 + (i1 * i2))}}[]<gpuarray>.0, dec_dense_0_0.W)]]

Here I'm trying to do 16 batches on each of 16 gpus total. So my guess is that the second batch_weights argument isn't being sliced properly. Should I be calling the function with the 'num_slices' or 'batch' keyword arguments?

Thanks!

astooke commented 7 years ago

Hmm in the line where it says: Inputs shapes: [(16, 4096), (16, 4096), (256, 4096)], where is the 256 coming from? Do you have an implicit input to the function that holds some input data which should be scattered?

mharradon commented 7 years ago

256 is from the two input variables - the batch_size is 256, which I'd like to be split up as 16 on each of 16 gpus. I believe everything specific to the batch size is in those input variables, but I'll double check.

mharradon commented 7 years ago

Also some output sizes are set by the input batch size.

astooke commented 7 years ago

If you don't provide batch or batch_s inputs, it will compute on the full SynkData inputs, which will be divided evenly among workers. num_slices applies within each worker, and will again divide the amount of data used in each individual function call.

mharradon commented 7 years ago

Oooh, nice! I think I've found someplace where I used the wrong batch_size in the network - I'm rerunning now.

mharradon commented 7 years ago

And if I'm understanding correctly in the lasange MNIST example 'synk.all_reduce(params)' means that the results of the update functions are averaged, as opposed to averaging the gradients first and then doing RMSProp etc.?

astooke commented 7 years ago

Thanks for your patience in trying this, I'm really hoping to get the docs up to date by late next week.

Yes, when you make a synk function, all updates apply locally only. Then calling all_reduce(params) averages the resulting parameters.

I haven't tried coding RMSProp yet, where you would want the formulas to account for values across all workers (the whole idea is to NOT change the algorithm). If you try to implement this and it turns out to be difficult or even just annoyingly different from how you do it in the serial case, please let me know. The algorithms I've used this for so far work like this: compute a raw gradient, do some other computations on it in different functions, and then manually update the parameters by setting the value using the final result.

mharradon commented 7 years ago

No problem, I'm just excited to have easy multi-gpu functions in Theano!

Regarding updates, that's definitely acceptable. I think there were a few papers that suggested that sort of thing is suboptimal, but I'm not really sure how much. And if I really wanted I could just return gradients and do the update myself.

I think the issues I'm having are with my code - I'll report back if I find anything new.

mharradon commented 7 years ago

I've successfully executed my function! I have just a bit more bug-squashing in my own code and hopefully I'll be able to report on the performance I'm seeing.