How to improve GPU utilization

zhao1402072392 commented 4 years ago

Firstly, Thanks for your sharing sir, it's really helpful to me. But it's too slow in my device(GeForce RTX 2080 Ti).In the case of default parameters, gpu usage is around 30%. And then I revise some paramters like I add thoses:--dynet-gpu --minibatch-size 2048 --no-dynet-autobatch.There is no change in the usage rate of gpu. What should I do to increase the usage rate of gpu?Thanks again^_^

danielhers commented 4 years ago

Hi @zhao1402072392 , very good question! Training and running TUPA has always been very slow, and is not much faster on GPU, even when using DyNet autobatch. I think the best way to make it faster is to batch instances (sentence/passages) together, as done in many other parsers, such as https://github.com/DreamerDeo/HIT-SCIR-CoNLL2019. Unfortunately, I never got around to it. I would be very grateful if you would have time to give it a shot and make a pull request, though!

zhao1402072392 commented 4 years ago

Hi Sir, I don't know what's meaning of the "batch instances (sentence/passages) together". could you give me the specific file including the mechanism you mentioned in https://github.com/DreamerDeo/HIT-SCIR-CoNLL2019.Thanks!

danielhers commented 4 years ago

I mean that the inputs (embeddings, features) for multiple sentences/passages (a batch, say, of 8) are joined together to one tensor, and the transitions are run in parallel for all of them. This is realtively easy in PyTorch, which the HIT-SCIR parser uses (https://pytorch.org/docs/stable/data.html). DyNet also supports batching, of course (https://dynet.readthedocs.io/en/latest/minibatch.html), but the way TUPA works right now is sequential - at every point it only sends to DyNet the data for a single transition in a single sentence. Parallelizing across transitions is not possible due to the sequential nature of the transition system (the decision about which transition to take depends on the state, which is affected by which transitions were taken before), but separate sentences are independent of each other and can in principle be handled in parallel. The HIT-SCIR parser uses a Stack LSTM, which makes this efficient, but TUPA currently doesn't use a Stack LSTM - only a window-based feature template. It should still be possible to batch, though, and it will make TUPA faster.

Specifically, the code in https://github.com/danielhers/tupa/blob/master/tupa/parse.py would need to be updated to handle a batch of passages at a time instead of just one. The BatchParser object is a step in that direction, but it still just loops over individual members of the batch and calls predict() for each one separately.

danielhers / tupa

How to improve GPU utilization #104