What is the advantage of using `gluonnlp.utils.parallel`?

zeeshansayyed commented 5 years ago

Question about the library

The machine translation example with transformers uses parallel. How is this different from the automatic parallelism which is built into MxNet? When should one use it?

Thanks

eric-haibin-lin commented 5 years ago

When you got a really fast GPU, a single python thread could be relatively slow to push computation to all GPUs when the batch size is small (GPU computation time is short). The parallel API is an attempt to improve that. If training time is not your bottleneck, you don't need to use it

zeeshansayyed commented 5 years ago

Thanks. This makes sense. I guess the best way to know whether you would need it is to actually implement both and see whether you are gaining any performance. Other than that, are there any rules of thumbs on GPU type and batch size which can help me decide this without actually implementing both? I am using V100s available on EC2.

eric-haibin-lin commented 5 years ago

You're right. We might work on a simpler API that don't require users to inherit from the interface so that users only need to change 1 or 2 lines of code. But not yet started. What is the network architecture and model parameter size? Is your model a hybrid block?

zeeshansayyed commented 5 years ago

My network is on the bigger side. I have BERT as the encoder (which is hybrid), but the overall network is not hybrid.

feevos commented 3 years ago

Hi @eric-haibin-lin is it possible to use the parallel module (modified, including only forward) to do a parallel forward pass? What I need to do is something like (demo pseudo-code):

class SomeNet(HybridBlock):
    def __input__(self, SomeArguments):
         super().__init__(self)

         net = gluon.HybridSequential()
         for _ in range(5):
             net.add(gluon.nn.Conv2D(channels=32,kernel_size=3,padding=1))
        self.convs = net
    def forward(self, *inputs):
        # Here inputs is a list of mx.np.array() of different spatial size SAME context (gpu)
        outs = []
        for input, conv in zip(inputs,self.convs):
            outs = outs + [conv(input)]
        return outs

I just found your answer and I will test customizing the parallel function you guys provide (thank you!!), but I would appreciate your expert advice.

edit; a main question I have is if I need to use with autograd.record() inside the forward loop, if I don't calculate loss function. I am planning to modify the forward_backward of Parallel with a simple forward (and corresponding definitions).

Regards, Foivos

szha commented 3 years ago

@feevos you can adapt the parallel function for the purpose of pushing work to GPU more efficiently. The main purpose of that interface is to push work with multiple threads using multithreaded queues. See https://github.com/dmlc/gluon-nlp/blob/v0.x/src/gluonnlp/utils/parallel.py#L125-L139. For forward-only, you just need to implement only the forward part, in which case autograd.record isn't needed.

feevos commented 3 years ago

Thank you very much @szha !!!

dmlc / gluon-nlp

What is the advantage of using `gluonnlp.utils.parallel`? #999