Futurewei-io / blue-marlin

Blue Marlin is a critical web infrastructure for advertising based monetization. It is a cloud platform that adds intelligence to a plain Ad System.
Apache License 2.0
5 stars 4 forks source link

[BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training #46 #263

Open srinaginikonduru opened 2 years ago

srinaginikonduru commented 2 years ago

Bimlesh759-AI

  1. Current DIN Lookalike model training is not supporting multiple gpu. We have two gpu available but it is using only one gpu always. It is desired that during training, It should use all available gpu.
  2. Or Can the script be modified to Tensorflow 2.0, In this version there are api for using all available gpu.
srinaginikonduru commented 2 years ago

jimmylao

@Bimlesh759-AI In the long run, for the topic of parallel training deep learning model, you may want to make use of all possible resource, including parallel (multiple GPUs) as well as distributed (multiple servers with GPUs) computing. So far, there may be 2 options that support both parallel and distributed training

  1. Tensorflow 2
  2. Uber's open source project - Horovod

Since the code for DIN model uses tensorflow 1.x, there are some effort need to be done using either TF 2 or Horovod. For TF2, current TF 1.x code need to be upgraded to TF 2 For Horovod, it supports TF 1.x, however, you need to take some effort to make it work. - it's the option for Amazon SageMaker

Most important, you may compare how much gain (speed-up) can be achieved by each of this approach. Conceptually, both of them should work for parallel training, in practice, quantitative evaluation of speed-up performance need to be compared.