Open shizhediao opened 4 years ago
did you have some issues in mind?
did you have some issues in mind?
Actually no. I have fixed a lot of issues and I think the training process of the current repo is going well. Due to financial reasons, I could not finish the training to reproduce the exact results. I'm not so sure and welcome to discuss this.
hi @shizhediao
Have you executed any of the codes in TPU version so far? If so, could you please give some logs on your experiments?
It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.
It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.
@gwern Thanks for sharing. However I found that simply wrap the original Pytorch BigGAN into a TPU-enable version seems to be super slow since there're some ops that requires context switching between CPU and TPU(e.g. interpolate2d in torch-xla-1.6
and torch-xla-1.7
). This repo https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan trains celebA using small batch size (64 global batch size while global batch size >=512 for imagenet). Also when testing for FID/IS, it turns out that there's something wrong thus leading to idle-TPU situation.
BTW, what do you mean by 'tensorfolks'? If you know other implementations for BigGAN on TPUs, please keep me posted. Thanks!
Hi everyone, I implemented three TPU enabled PyTorch training repos for BigGAN-PyTorch, all of which are based on this repo.
BigGAN-PyTorch-TPU-Single: Training BigGAN with a single TPU. BigGAN-PyTorch-TPU-Parallel: Parallel version (multiple-thread) for training BigGAN with TPU. BigGAN-PyTorch-TPU-Distribute: Distributed version (multiple-process) for training BigGAN with TPU.
I have checked the training process which seems to be normal. There may be some potential issues (sorry that I'm a novice for TPU training). Pull requests to fix some of the issues would be appreciated and welcome to discuss.