Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

GPT2-large on Colab TPU seems to time out #996

Closed bilal2vec closed 4 years ago

bilal2vec commented 4 years ago

🐛 Bug

When training gpt2-large on a colab tpu, gpt2-large doesn't work

To Reproduce

See the colab notebook: https://colab.research.google.com/drive/1An6D3wh_H4dbmlEUHYOXZYxkH6S7VKu9

This is the relevant part of the stack trace:

INFO:root:training on 8 TPU cores
2020-03-02 00:43:14.794597: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.857680: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.918609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.974498: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.031540: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.087601: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.142553: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
E0302 00:43:22.445484458    1536 server_chttp2.cc:40]        {"created":"@1583109802.445465277","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1583109802.445463004","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:57271"}]}
2020-03-02 00:43:24.109498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.429623: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.712988: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.731491: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.867584: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.883436: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:25.112841: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
INFO:root:INIT TPU local core: 0, global rank: 0
2020-03-02 00:44:11.382078: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
INFO:root:INIT TPU local core: 2, global rank: 2
INFO:root:INIT TPU local core: 5, global rank: 5
2020-03-02 00:44:15.925331: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'pl.Trainer.run_pretrain_routine': Socket closed (14)
Traceback (most recent call last):
  File "finetune.py", line 129, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 976, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGKILL

Expected behavior

The code works when training on gpt2 (124M) but doesn't when training on gpt2-large (774M)

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.17.5
[pip3] torch==1.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect

Additional context

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

williamFalcon commented 4 years ago

@bkkaggle try again using the latest version.

bilal2vec commented 4 years ago

I updated the colab notebook, the error remains but it looks like it's because pytorch/xla is loading the data to all the processes, causing an OOM. (https://github.com/pytorch/xla/issues/1280#issuecomment-548607522)

Closing

williamFalcon commented 4 years ago

@dlibenzi fyi.

@bkkaggle maybe file a bug in xla repo?

dlibenzi commented 4 years ago

It's likely the kernel OOM killer triggering this. Colabs have limited memory and cores, so cannot run very large workloads. We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.

williamFalcon commented 4 years ago

@srush fyi

srush commented 4 years ago

Yup, this is what I saw as well. You need enough RAM to have the model loaded 8 times.