Closed bilal2vec closed 4 years ago
Hi! thanks for your contribution!, great first issue!
@bkkaggle try again using the latest version.
I updated the colab notebook, the error remains but it looks like it's because pytorch/xla is loading the data to all the processes, causing an OOM. (https://github.com/pytorch/xla/issues/1280#issuecomment-548607522)
Closing
@dlibenzi fyi.
@bkkaggle maybe file a bug in xla repo?
It's likely the kernel OOM killer triggering this. Colabs have limited memory and cores, so cannot run very large workloads. We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.
@srush fyi
Yup, this is what I saw as well. You need enough RAM to have the model loaded 8 times.
🐛 Bug
When training gpt2-large on a colab tpu, gpt2-large doesn't work
To Reproduce
See the colab notebook: https://colab.research.google.com/drive/1An6D3wh_H4dbmlEUHYOXZYxkH6S7VKu9
This is the relevant part of the stack trace:
Expected behavior
The code works when training on gpt2 (124M) but doesn't when training on gpt2-large (774M)
Environment
Additional context