[BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training #46

jimmylao

@Bimlesh759-AI In the long run, for the topic of parallel training deep learning model, you may want to make use of all possible resource, including parallel (multiple GPUs) as well as distributed (multiple servers with GPUs) computing. So far, there may be 2 options that support both parallel and distributed training

Tensorflow 2
Uber's open source project - Horovod

Since the code for DIN model uses tensorflow 1.x, there are some effort need to be done using either TF 2 or Horovod. For TF2, current TF 1.x code need to be upgraded to TF 2 For Horovod, it supports TF 1.x, however, you need to take some effort to make it work. - it's the option for Amazon SageMaker

Most important, you may compare how much gain (speed-up) can be achieved by each of this approach. Conceptually, both of them should work for parallel training, in practice, quantitative evaluation of speed-up performance need to be compared.

Futurewei-io / blue-marlin

[BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training #46 #263