Closed agemagician closed 3 years ago
I am working on model parallel TPU support, it's almost ready :)
Thanks @myleott for your quick reply.
We are really waiting for it.
Do you have any timeline as we need to schedule the models that we will train for our project?
There's a branch which "runs," but there's something wrong the way we init/modify RNG state, since it seems to converge poorly compared to similar runs on GPU.
I'm hoping to dig into the discrepancy in the next couple weeks.
Perfect, we hope that it will be finished soon. We will put it into our schedule as it should be ready soon from your side.
Any update about the TPU compatibility?
Hey, unfortunately this will be a bit delayed. We’re migrating to use fairscale as the backend for this particular code, so we’ll need to update the code there for TPU support.
Because you did not set the arguement "model-parallel-size" when using roberta, and you would not use intra-layer model parallel.
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Looking forward to support for TPU
You can use dev_tpu_mp
branch:
I haven't tested it recently, and it may no longer work with the latest XLA code. Unfortunately this is not a direction we are prioritizing at this time, so we can't provide much support.
🐛 Bug
We are training different language models for protein sequences, which is part of the effort to fighting Covid-19. We already have published several pertained models trained on Summit (6k GPUs )and TPU Pods (V3-1024 and V3-512), and we are interested on training Megatron: https://github.com/agemagician/ProtTrans
We are testing Megatron training on Colab TPUs, but it fails. However, Roberta works fine.
To Reproduce
Works fine:
Roberta Works fine:
Megatron fails:
Roberta working results:
Megatron errors:
Any idea how we could fix this issue ?
Your reply is highly appreciated.