Implement Multi-GPU training

Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch

https://unbabel.github.io/OpenKiwi/

GNU Affero General Public License v3.0

229 stars 48 forks source link

Implement Multi-GPU training #33

Open captainvera opened 5 years ago

captainvera commented 5 years ago

Is your feature request related to a problem? Please describe. Currently, there is no solution to training QE models using multiple GPUs which can significantly speed up training of large models. There are have been several issues of people requesting this feature. (#31 #29)

Describe the solution you'd like Ideally, it should be possible to pass several GPU IDs to the gpu-id yams flag. OpenKiwi should use all of them in parallel to train the model.

Additional context An important thing to take into account is that other parts of the pipeline might become a bottleneck when using multiple GPUS. Things like data injestion/tokenisation, etc

BUAAers commented 5 years ago

Whether it is a future work for you to implement multi-GPU training with DistributedDataParallel function.

captainvera commented 5 years ago

I'm sorry, I'm not sure if I understood your comment @BUAAers. Are you asking for us to implement multi-GPU with DistributedDataParallel instead/in addition to implementing with the local DataParallel class?

francoishernandez commented 4 years ago

Hi guys, Just starting to get hands-on with your toolkit. Multi GPU training would surely be useful to speed up experiments. Is this under active development on your side? Thanks!

captainvera commented 4 years ago

Hi @francoishernandez,

Yes, we paused development of this feature for a while, but as of right now it is under active development on our side. We are making some structural changes to the Kiwi framework (mostly data-loading and how we handle that, see additional context of original issue [spoiler: it was a huge bottleneck]) and intend to bump the version shortly to include both these changes and add Multi-gpu training. Stay tuned!

Meanwhile, PRs are very welcome, thanks for your fix!

captainvera commented 4 years ago

Update on this.

Our initial impression was that we'd get Multi-gpu for "free" by adopting Pytorch-Lightning. Unfortunately while it does make it much easier to support. There are issues with metrics calculation (calculating corpus-level metrics when data is separated through the gpus) which made it harder to implement.

This is not a priority for us and has been paused from our side.

In case anyone would like to take a stab a it, feel free to comment on this issue as I have made some (non-public) progress on making it work with kiwi >=2.0.0.