Closed mharradon closed 7 years ago
Hmm in the line where it says: Inputs shapes: [(16, 4096), (16, 4096), (256, 4096)]
, where is the 256 coming from? Do you have an implicit input to the function that holds some input data which should be scattered?
256 is from the two input variables - the batch_size is 256, which I'd like to be split up as 16 on each of 16 gpus. I believe everything specific to the batch size is in those input variables, but I'll double check.
Also some output sizes are set by the input batch size.
If you don't provide batch
or batch_s
inputs, it will compute on the full SynkData inputs, which will be divided evenly among workers. num_slices
applies within each worker, and will again divide the amount of data used in each individual function call.
Oooh, nice! I think I've found someplace where I used the wrong batch_size in the network - I'm rerunning now.
And if I'm understanding correctly in the lasange MNIST example 'synk.all_reduce(params)' means that the results of the update functions are averaged, as opposed to averaging the gradients first and then doing RMSProp etc.?
Thanks for your patience in trying this, I'm really hoping to get the docs up to date by late next week.
Yes, when you make a synk function, all updates apply locally only. Then calling all_reduce(params)
averages the resulting parameters.
I haven't tried coding RMSProp yet, where you would want the formulas to account for values across all workers (the whole idea is to NOT change the algorithm). If you try to implement this and it turns out to be difficult or even just annoyingly different from how you do it in the serial case, please let me know. The algorithms I've used this for so far work like this: compute a raw gradient, do some other computations on it in different functions, and then manually update the parameters by setting the value using the final result.
No problem, I'm just excited to have easy multi-gpu functions in Theano!
Regarding updates, that's definitely acceptable. I think there were a few papers that suggested that sort of thing is suboptimal, but I'm not really sure how much. And if I really wanted I could just return gradients and do the update myself.
I think the issues I'm having are with my code - I'll report back if I find anything new.
I've successfully executed my function! I have just a bit more bug-squashing in my own code and hopefully I'll be able to report on the performance I'm seeing.
My dataset is too large to fit into memory at once, and right now my function is written so I pass in my training examples as inputs rather than just indices and using shared variables. So to get started I'd like to just call the function on a single minibatch rather than use the in-memory batch functionality.
My code is basically this:
Here batch weights is (batch_size,1) and input_data is (batch_size,1,x,y,z).
Calling this function results in the following:
Here I'm trying to do 16 batches on each of 16 gpus total. So my guess is that the second batch_weights argument isn't being sliced properly. Should I be calling the function with the 'num_slices' or 'batch' keyword arguments?
Thanks!