xla_spawn: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

adamcatto commented 2 years ago

🐛 Bug

I am attempting to train a model on a Colab TPU, and got this error:

WARNING:root:TPU has started up successfully with version pytorch-1.11
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/utilities.py:95: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  category=PossibleUserWarning,
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name             | Type       | Params
------------------------------------------------
0 | backbone         | ResNet     | 23.5 M
1 | online_encoder   | NetWrapper | 33.0 M
2 | online_predictor | MLP        | 2.1 M 
3 | target_encoder   | NetWrapper | 33.0 M
------------------------------------------------
21.0 M    Trainable params
23.5 M    Non-trainable params
44.5 M    Total params
178.069   Total estimated model params size (MB)
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py:1937: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  category=PossibleUserWarning,
Epoch 0:   0% 0/2 [00:00<?, ?it/s] Exception in device=TPU:3: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 99, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance
    self.optimizer_idx,
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
  File "/content/scratch_refactor/src/run.py", line 42, in <module>
    pretrainer.fit(model=ssl_model, datamodule=pretraining_data)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 80, in launch
    start_method=self._start_method,
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 154, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with exit code 17

I am not sure what's going on here, as no CUDA devices are present!

From the error message it appears there is an issue with the optimizer step; I am using a custom optimizer, so I will paste the optimizer code here:

class LARS(Optimizer):
    """
    Layer-wise Adaptive Rate Scaling for large batch training.
    Introduced by "Large Batch Training of Convolutional Networks" by Y. You,
    I. Gitman, and B. Ginsburg. (https://arxiv.org/abs/1708.03888)
    """

    def __init__(
        self,
        params,
        lr=required,
        momentum=0.9,
        use_nesterov=False,
        weight_decay=0.0,
        exclude_from_weight_decay=None,
        exclude_from_layer_adaptation=None,
        classic_momentum=True,
        eeta=EETA_DEFAULT,
    ):
        """Constructs a LARSOptimizer.
        Args:
        lr: A `float` for learning rate.
        momentum: A `float` for momentum.
        use_nesterov: A 'Boolean' for whether to use nesterov momentum.
        weight_decay: A `float` for weight decay.
        exclude_from_weight_decay: A list of `string` for variable screening, if
            any of the string appears in a variable's name, the variable will be
            excluded for computing weight decay. For example, one could specify
            the list like ['batch_normalization', 'bias'] to exclude BN and bias
            from weight decay.
        exclude_from_layer_adaptation: Similar to exclude_from_weight_decay, but
            for layer adaptation. If it is None, it will be defaulted the same as
            exclude_from_weight_decay.
        classic_momentum: A `boolean` for whether to use classic (or popular)
            momentum. The learning rate is applied during momeuntum update in
            classic momentum, but after momentum for popular momentum.
        eeta: A `float` for scaling of learning rate when computing trust ratio.
        name: The name for the scope.
        """

        self.epoch = 0
        defaults = dict(
            lr=lr,
            momentum=momentum,
            use_nesterov=use_nesterov,
            weight_decay=weight_decay,
            exclude_from_weight_decay=exclude_from_weight_decay,
            exclude_from_layer_adaptation=exclude_from_layer_adaptation,
            classic_momentum=classic_momentum,
            eeta=eeta,
        )

        super(LARS, self).__init__(params, defaults)
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.use_nesterov = use_nesterov
        self.classic_momentum = classic_momentum
        self.eeta = eeta
        self.exclude_from_weight_decay = exclude_from_weight_decay
        # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the
        # arg is None.
        if exclude_from_layer_adaptation:
            self.exclude_from_layer_adaptation = exclude_from_layer_adaptation
        else:
            self.exclude_from_layer_adaptation = exclude_from_weight_decay

    def step(self, epoch=None, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        if epoch is None:
            epoch = self.epoch
            self.epoch += 1

        for group in self.param_groups:
            weight_decay = group["weight_decay"]
            momentum = group["momentum"]
            eeta = group["eeta"]
            lr = group["lr"]

            for p in group["params"]:
                if p.grad is None:
                    continue

                param = p.data
                grad = p.grad.data

                param_state = self.state[p]

                # TODO: get param names
                # if self._use_weight_decay(param_name):
                grad += self.weight_decay * param

                if self.classic_momentum:
                    trust_ratio = 1.0

                    # TODO: get param names
                    # if self._do_layer_adaptation(param_name):
                    w_norm = torch.norm(param)
                    g_norm = torch.norm(grad)

                    device = g_norm.get_device()

                    trust_ratio = torch.where(
                        w_norm.ge(0),
                        torch.where(
                            g_norm.ge(0),
                            (self.eeta * w_norm / g_norm),
                            torch.tensor(
                                [1.0], 
                                device=device
                            ),
                        ),
                        torch.tensor([1.0], 
                        device=device
                        ),
                    ).detach()

                    scaled_lr = lr * trust_ratio
                    if "momentum_buffer" not in param_state:
                        next_v = param_state["momentum_buffer"] = torch.zeros_like(
                            p.data,
                            device=device
                        )
                    else:
                        next_v = param_state["momentum_buffer"]

                    # next_v.mul_(momentum).add_(grad, alpha=scaled_lr)

                    #instead of above ^ add_() with alpha, use the following
                    # to work around trust_ratio item() call; 
                    # avoid accelerator <--> CPU communication
                    next_v.mul_(momentum).add_(torch.mul(grad, scaled_lr)) #TODO{make sure this works as expected}
                    if self.use_nesterov:
                        update = (self.momentum * next_v) + (scaled_lr * grad)
                    else:
                        update = next_v

                    p.data.add_(-update)
                else:
                    raise NotImplementedError

        return loss

    def _use_weight_decay(self, param_name):
        """Whether to use L2 weight decay for `param_name`."""
        if not self.weight_decay:
            return False
        if self.exclude_from_weight_decay:
            for r in self.exclude_from_weight_decay:
                if re.search(r, param_name) is not None:
                    return False
        return True

    def _do_layer_adaptation(self, param_name):
        """Whether to do layer-wise learning rate adaptation for `param_name`."""
        if self.exclude_from_layer_adaptation:
            for r in self.exclude_from_layer_adaptation:
                if re.search(r, param_name) is not None:
                    return False
        return True

I am wondering if the CUDA/Nvidia environment variables have something to do with this? (see output of os.environ in Environment section) – do I somehow need to remove CUDA-related information from the environment?

Environment

Output of collect_env_details.py script:

collect_env_details 100%[===================>]   2.54K  --.-KB/s    in 0s      

2022-07-15 16:35:52 (30.2 MB/s) - ‘collect_env_details.py’ saved [2598/2598]

WARNING:root:TPU has started up successfully with version pytorch-1.11
* CUDA:
    - GPU:
    - available:         False
    - version:           10.2
* Packages:
    - numpy:             1.21.6
    - pyTorch_debug:     False
    - pyTorch_version:   1.11.0+cu102
    - pytorch-lightning: 1.6.5
    - tqdm:              4.64.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - 
    - processor:         x86_64
    - python:            3.7.13
    - version:           #1 SMP Sun Apr 24 10:03:06 PDT 2022

Additional details:

Any other relevant information: output of os.environ

environ({'CUDNN_VERSION': '8.0.5.39', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'CLOUDSDK_PYTHON': 'python3', 'LANG': 'en_US.UTF-8', 'HOSTNAME': '58bea502a81e', 'XRT_TPU_CONFIG': 'tpu_worker;0;10.39.221.146:8470', 'OLDPWD': '/', 'CLOUDSDK_CONFIG': '/content/.config', 'DATALAB_SETTINGS_OVERRIDES': '{"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip=172.28.0.2"],"debugAdapterMultiplexerPath":"/usr/local/bin/dap_multiplexer","enableLsp":true}', 'ENV': '/root/.bashrc', 'NCCL_VERSION': '2.7.8', 'TF_FORCE_GPU_ALLOW_GROWTH': 'true', 'NO_GCE_CHECK': 'False', 'PWD': '/', 'HOME': '/root', 'LAST_FORCED_REBUILD': '20220712', 'DEBIAN_FRONTEND': 'noninteractive', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'GCE_METADATA_TIMEOUT': '3', 'GLIBCPP_FORCE_NEW': '1', 'TBE_CREDS_ADDR': '172.28.0.1:8008', 'SHELL': '/bin/bash', 'GCS_READ_CACHE_BLOCK_SIZE_MB': '16', 'PYTHONWARNINGS': 'ignore:::pip._internal.cli.base_command', 'CUDA_VERSION': '11.1.1', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'COLAB_TPU_ADDR': '10.39.221.146:8470', 'SHLVL': '0', 'PYTHONPATH': '/env/python', 'TBE_EPHEM_CREDS_ADDR': '172.28.0.1:8009', 'GLIBCXX_FORCE_NEW': '1', 'PATH': '/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin', 'LD_PRELOAD': '/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4', 'TPU_NAME': 'grpc://10.39.221.146:8470', 'JPY_PARENT_PID': '42', 'TERM': 'xterm-color', 'CLICOLOR': '1', 'PAGER': 'cat', 'GIT_PAGER': 'cat', 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline', 'ENABLE_DIRECTORYPREFETCHER': '1', 'USE_AUTH_EPHEM': '1', 'PYDEVD_USE_FRAME_EVAL': 'NO'})

cc @kaushikb11 @rohitgr7 @akihironitta

awaelchli commented 2 years ago

Hey @adamcatto

This happens when you try to call cuda functions in the main process before calling e.g. Trainer.fit(), and then try to call cuda functions again somewhere during trainer.fit. Please check in your code that you're not calling .to("cuda") or similar device operations when using the TPU accelerator. I see some device logic in that optimizer code.

And if that doesn't help, would you mind sharing a reproducible snippet of code? If you don't mind, you could share the link to a copy of your colab directly.

adamcatto commented 2 years ago

Thanks @awaelchli turned out the issue was apparently that assigning device = g_norm.get_device() did not retain XLA device information, so initializing tensors using this device resulted in CUDA tensor. I changed the assignment to device = g_norm.device and it worked. Strangely enough, removing device calls and initializing the tensor without device resulted in device mismatch, which I would have thought pytorch-lightning would handle. I wonder what the issue there is...

awaelchli commented 2 years ago

Lightning can only handle the transfer of data from the dataloader to the device automatically and placing the model weights on the device. Any other tensors that you create you need to move yourself. You can use the self.device attribute on the LightningModule if you need that device information.

Lightning-AI / pytorch-lightning

xla_spawn: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method #13677

🐛 Bug

Environment