Welcome to unofficial TPU enabled PyTorch implementation

shizhediao commented 4 years ago

Hi everyone, I implemented three TPU enabled PyTorch training repos for BigGAN-PyTorch, all of which are based on this repo.

BigGAN-PyTorch-TPU-Single: Training BigGAN with a single TPU. BigGAN-PyTorch-TPU-Parallel: Parallel version (multiple-thread) for training BigGAN with TPU. BigGAN-PyTorch-TPU-Distribute: Distributed version (multiple-process) for training BigGAN with TPU.

I have checked the training process which seems to be normal. There may be some potential issues (sorry that I'm a novice for TPU training). Pull requests to fix some of the issues would be appreciated and welcome to discuss.

raijinspecial commented 4 years ago

did you have some issues in mind?

shizhediao commented 4 years ago

did you have some issues in mind?

Actually no. I have fixed a lot of issues and I think the training process of the current repo is going well. Due to financial reasons, I could not finish the training to reproduce the exact results. I'm not so sure and welcome to discuss this.

Leiwx52 commented 4 years ago

hi @shizhediao

Have you executed any of the codes in TPU version so far? If so, could you please give some logs on your experiments?

gwern commented 4 years ago

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

Leiwx52 commented 3 years ago

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

@gwern Thanks for sharing. However I found that simply wrap the original Pytorch BigGAN into a TPU-enable version seems to be super slow since there're some ops that requires context switching between CPU and TPU(e.g. interpolate2d in torch-xla-1.6 and torch-xla-1.7). This repo https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan trains celebA using small batch size (64 global batch size while global batch size >=512 for imagenet). Also when testing for FID/IS, it turns out that there's something wrong thus leading to idle-TPU situation. BTW, what do you mean by 'tensorfolks'? If you know other implementations for BigGAN on TPUs, please keep me posted. Thanks!

ajbrock / BigGAN-PyTorch

Welcome to unofficial TPU enabled PyTorch implementation #61