Open zeeshansayyed opened 5 years ago
When you got a really fast GPU, a single python thread could be relatively slow to push computation to all GPUs when the batch size is small (GPU computation time is short). The parallel API is an attempt to improve that. If training time is not your bottleneck, you don't need to use it
Thanks. This makes sense. I guess the best way to know whether you would need it is to actually implement both and see whether you are gaining any performance. Other than that, are there any rules of thumbs on GPU type and batch size which can help me decide this without actually implementing both? I am using V100s available on EC2.
You're right. We might work on a simpler API that don't require users to inherit from the interface so that users only need to change 1 or 2 lines of code. But not yet started. What is the network architecture and model parameter size? Is your model a hybrid block?
My network is on the bigger side. I have BERT as the encoder (which is hybrid), but the overall network is not hybrid.
Hi @eric-haibin-lin is it possible to use the parallel module (modified, including only forward) to do a parallel forward pass? What I need to do is something like (demo pseudo-code):
class SomeNet(HybridBlock):
def __input__(self, SomeArguments):
super().__init__(self)
net = gluon.HybridSequential()
for _ in range(5):
net.add(gluon.nn.Conv2D(channels=32,kernel_size=3,padding=1))
self.convs = net
def forward(self, *inputs):
# Here inputs is a list of mx.np.array() of different spatial size SAME context (gpu)
outs = []
for input, conv in zip(inputs,self.convs):
outs = outs + [conv(input)]
return outs
I just found your answer and I will test customizing the parallel function you guys provide (thank you!!), but I would appreciate your expert advice.
edit; a main question I have is if I need to use with autograd.record()
inside the forward loop, if I don't calculate loss function. I am planning to modify the forward_backward
of Parallel with a simple forward (and corresponding definitions).
Regards, Foivos
@feevos you can adapt the parallel function for the purpose of pushing work to GPU more efficiently. The main purpose of that interface is to push work with multiple threads using multithreaded queues. See https://github.com/dmlc/gluon-nlp/blob/v0.x/src/gluonnlp/utils/parallel.py#L125-L139. For forward-only, you just need to implement only the forward part, in which case autograd.record isn't needed.
Thank you very much @szha !!!
Question about the library
The machine translation example with transformers uses parallel. How is this different from the automatic parallelism which is built into MxNet? When should one use it?
Thanks