Open srinaginikonduru opened 2 years ago
@Bimlesh759-AI In the long run, for the topic of parallel training deep learning model, you may want to make use of all possible resource, including parallel (multiple GPUs) as well as distributed (multiple servers with GPUs) computing. So far, there may be 2 options that support both parallel and distributed training
Since the code for DIN model uses tensorflow 1.x, there are some effort need to be done using either TF 2 or Horovod. For TF2, current TF 1.x code need to be upgraded to TF 2 For Horovod, it supports TF 1.x, however, you need to take some effort to make it work. - it's the option for Amazon SageMaker
Most important, you may compare how much gain (speed-up) can be achieved by each of this approach. Conceptually, both of them should work for parallel training, in practice, quantitative evaluation of speed-up performance need to be compared.
Bimlesh759-AI