I have skimmed through the papers however didn't find the detailed explanation on accumulate gradients. Please help me understand. Generally simplified flow is like
predicted_output = model(input)
loss = loss_function(predicted_output, ground_truth)
optimizer.zero_grad()
loss.backward()
optimizer.step()
However in code, gradients are accumulated for 10 iterations and then reset. I am wondering what +ve or -ve impacts it will have if I
1: reset on each iteration means along the lines of above general algorithm flow
2: increase/decrease the self.iter_size
3: add support for multi-batching and multi-gpu
I have skimmed through the papers however didn't find the detailed explanation on accumulate gradients. Please help me understand. Generally simplified flow is like
predicted_output = model(input) loss = loss_function(predicted_output, ground_truth) optimizer.zero_grad() loss.backward() optimizer.step()
However in code, gradients are accumulated for 10 iterations and then reset. I am wondering what +ve or -ve impacts it will have if I
1: reset on each iteration means along the lines of above general algorithm flow 2: increase/decrease the self.iter_size 3: add support for multi-batching and multi-gpu
Many thanks.