Problem with elastic launch on slurm cluster

Uio96 commented 1 year ago

Bug description

Hi team,

I tried to use elastic launch with pytorch lightning on a slurm cluster (one node with multiple gpus). The script worked fine if I use interactive mode but id not work under submit mode. I followed the tutorial in https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html.

The error is like

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:44927 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

There was no help to change the port number manually. I tried to remove the elastic launch part and it was fine. I also tried the official pyotrch tutorial and it was also no problem https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series

So I guess the issue is that both elastic launcher and pytorch lightning ddp have multiple calls for the port, which is in conflict. I am not sure if I need to set up something else. Thank you so much.

What version are you seeing the problem on?

v1.9, v2.0

How to reproduce the bug

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import os
from datetime import timedelta
import torch.distributed.launcher as dist_launcher
from torch.utils.data import Dataset, DataLoader, RandomSampler
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
from pytorch_lightning.callbacks import Callback
from uuid import uuid4
import socket

# DEVICES = [0, 1, 2, 3, 4, 5, 6, 7]
DEVICES = [0, 1]

class SimpleModel(pl.LightningModule):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer = nn.Linear(2, 1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x = batch.float()
        print(x)

        y = self(x)
        loss = nn.MSELoss()(y, torch.zeros_like(y))

        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.SGD(self.parameters(), lr=0.01)

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.tensor(np.random.randint(0, 1000, 2))
        # return torch.tensor([index, index])

    def __len__(self):
        return 10

def train():

    dataset = RandomDataset()
    dataloader = DataLoader(dataset, batch_size=2, sampler=RandomSampler(dataset), num_workers=0)

    # Initialize the Lightning Module (your model)
    model = SimpleModel()

    # Initialize a trainer with DDP
    trainer = pl.Trainer(
        devices = DEVICES,
        accelerator="gpu",
        strategy="ddp",
        max_epochs=2,
    )

    # Start training
    trainer.fit(model, dataloader)

def elastic_train():

    elastic_parameters = dist_launcher.LaunchConfig(
            min_nodes=1, 
            max_nodes=1,
            nproc_per_node=len(DEVICES),
            rdzv_backend="c10d", 
            rdzv_endpoint= "localhost:12345", # this is the port that the trainer will use to communicate with the launcher
            run_id=f"perfect_track_{uuid4()}", # run_id just has to be globally unique
            max_restarts=0, # for fault tolerance; for testing set it to 0 (no fault tolerance)
            start_method="spawn",
        )

    dist_launcher.elastic_launch(elastic_parameters, train)()

if __name__ == "__main__":

    # No issues
    # train()

    # Problematic on slurm cluster
    elastic_train()

Error messages and logs

GPU available: True (cuda), used: True
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:37025 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 3     
---------------------------------
3         Trainable params
0         Non-trainable params
3         Total params
0.000     Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
`Trainer.fit` stopped: `max_epochs=2` reached.
failed (exitcode: 1) local_rank: 1 (pid: 4105770) of fn: train (start_method: spawn)
Traceback (most recent call last):
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 455, in _poll
    self._pc.join(-1)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 371, in _wrap
    ret = record(fn)(*args_)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/private/home/yunzhil/posetrack/Test_GenPoseTrack/test_randomness_pytorch/test_ddp.py", line 64, in train
    trainer.fit(model, dataloader)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 938, in _run
    self.strategy.setup_environment()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 143, in setup_environment
    self.setup_distributed()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/lightning_fabric/utilities/distributed.py", line 258, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:37025 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Traceback (most recent call last):
  File "test_ddp.py", line 87, in <module>
    elastic_train()
  File "test_ddp.py", line 79, in elastic_train
    dist_launcher.elastic_launch(elastic_parameters, train)()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train FAILED

Environment

Current environment

* CUDA: - GPU: - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - Tesla V100-SXM2-32GB - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.8 - torch: 1.13.1 - torchaudio: 0.13.1 - torchmetrics: 1.1.1 - torchvision: 0.14.1 * Packages: - aiohttp: 3.8.5 - aiosignal: 1.3.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - brotlipy: 0.7.0 - certifi: 2023.7.22 - cffi: 1.15.1 - charset-normalizer: 2.0.4 - cryptography: 41.0.2 - frozenlist: 1.4.0 - fsspec: 2023.9.0 - idna: 3.4 - lightning-utilities: 0.9.0 - mkl-fft: 1.3.6 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - multidict: 6.0.4 - numpy: 1.24.3 - packaging: 23.1 - pillow: 9.4.0 - pip: 23.2.1 - pycparser: 2.21 - pyopenssl: 23.2.0 - pysocks: 1.7.1 - pytorch-lightning: 2.0.8 - pyyaml: 6.0.1 - requests: 2.31.0 - setuptools: 68.0.0 - torch: 1.13.1 - torchaudio: 0.13.1 - torchmetrics: 1.1.1 - torchvision: 0.14.1 - tqdm: 4.66.1 - typing-extensions: 4.7.1 - urllib3: 1.26.16 - wheel: 0.38.4 - yarl: 1.9.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.17 - release: 5.4.0-124-generic - version: #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022

More info

No response

cc @awaelchli

awaelchli commented 1 year ago

Hey @Uio96

Lightning is made so that it automatically detects the slurm environment. If you intend to launch processes yourself, you'll need to override this detection by:

from lightning.pytorch.plugins import LightningEnvironment

trainer = Trainer(..., plugins=LightningEnvironment())

Please note that, while what you are doing is possible, it is not our recommended default experience for most SLURM users. See here how Lightning works with SLURM in the user guide: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

Uio96 commented 1 year ago

Hey @Uio96

Lightning is made so that it automatically detects the slurm environment. If you intend to launch processes yourself, you'll need to override this detection by:
from lightning.pytorch.plugins import LightningEnvironment

trainer = Trainer(..., plugins=LightningEnvironment())
Please note that, while what you are doing is possible, it is not our recommended default experience for most SLURM users. See here how Lightning works with SLURM in the user guide: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

Thank you so much. It does solve my issue.

Uio96 commented 1 year ago

The previous trick did work on a single node. But I still have trouble with multiple nodes. I made minor edits to my python script by changing num_nodes to 2 (2 gpus for each) and passing a master node IP & port. The process got stuck there for a long time and returned timeout error. Maybe I need to set up some other parameters?

On the other hand, I also tried to play around with it by removing the elastic launcher part and followed https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html to set up the parameter for slurm. There came a new issue with rank, similar to the issue mentioned before https://github.com/Lightning-AI/lightning/discussions/7275#discussioncomment-703240. But since the version has updated, the old tricks cannot work in the latest one:

Traceback (most recent call last):
  File "test_ddp_no_elastic.py", line 191, in <module>
    train(args)
  File "test_ddp_no_elastic.py", line 166, in train
    trainer.fit(model, dataloader)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 194, in run
    self.setup_data()
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 237, in setup_data
    dl = _process_dataloader(trainer, dl)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 499, in _process_dataloader
    dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _prepare_dataloader
    sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 211, in _resolve_sampler
    sampler = _get_distributed_sampler(
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 252, in _get_distributed_sampler
    return DistributedSampler(dataloader.dataset, **kwargs)
  File "/private/home/yunzhil/.conda/envs/ddp_env/lib/python3.8/site-packages/torch/utils/data/distributed.py", line 74, in __init__
    raise ValueError(
ValueError: Invalid rank 2, rank should be in the interval [0, 1]

Here is my updated script:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import os
from datetime import timedelta
import torch.distributed.launcher as dist_launcher
from torch.utils.data import Dataset, DataLoader, RandomSampler
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
from pytorch_lightning.callbacks import Callback
from uuid import uuid4
import socket
import argparse
from pytorch_lightning.plugins.environments import LightningEnvironment

# DEVICES = [0, 1, 2, 3, 4, 5, 6, 7]
DEVICES = [0, 1]

class SimpleModel(pl.LightningModule):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer = nn.Linear(2, 1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x = batch.float()
        print(x)

        y = self(x)
        loss = nn.MSELoss()(y, torch.zeros_like(y))

        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.SGD(self.parameters(), lr=0.01)

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return torch.tensor(np.random.randint(0, 1000, 2))
        # return torch.tensor([index, index])

    def __len__(self):
        return 10

def train(args):

    dataset = RandomDataset()

    dataloader = DataLoader(dataset, batch_size=2, sampler=RandomSampler(dataset), num_workers=0)

    # Initialize the Lightning Module (your model)
    model = SimpleModel()

    # Initialize a trainer with DDP
    trainer = pl.Trainer(
        num_nodes=args.num_node,
        devices = 2,
        accelerator="gpu",
        strategy="ddp",
        max_epochs=2,
        plugins=LightningEnvironment(),
    )

    # Start training
    trainer.fit(model, dataloader)

def elastic_train(args):

    elastic_parameters = dist_launcher.LaunchConfig(
            min_nodes=args.num_node, 
            max_nodes=args.num_node,
            nproc_per_node=len(DEVICES),
            rdzv_backend="c10d", 
            rdzv_endpoint= args.rdzv_endpoint, # this is the port that the trainer will use to communicate with the launcher
            run_id=f"perfect_track_{uuid4()}", # run_id just has to be globally unique
            max_restarts=0, # for fault tolerance; for testing set it to 0 (no fault tolerance)
            start_method="spawn",
        )

    dist_launcher.elastic_launch(elastic_parameters, train)(args)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num_node", type=int, default=1)
    parser.add_argument("--rdzv_endpoint", type=str, default="localhost:12345")

    args = parser.parse_args()

    # train(args)

    # # Problematic on slurm cluster
    elastic_train(args)

awaelchli commented 1 year ago

@Uio96 What is your motivation for going against Lightning's slurm integration and doing the launching yourself? Are there any features in the elastic launcher that you need?

I suggest that you first test your custom launch method on a regular pytorch script on multi-node, and once that works you can port that over to the Lightning script.

Uio96 commented 1 year ago

@Uio96 What is your motivation for going against Lightning's slurm integration and doing the launching yourself? Are there any features in the elastic launcher that you need?

I suggest that you first test your custom launch method on a regular pytorch script on multi-node, and once that works you can port that over to the Lightning script.

Thanks for the suggestions. I plan to use elastic launch to have a better management of node and dataloader's workers.

I did follow the pytorch tutorial and it worked fine on multiple nodes setting on slurm, e.g., https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py. I am new to pytorch lightning. So I am not sure if need to something different in pytorch lightning.

I found that Pytorch lightning provided some tutorials for slurm and elastic launch https://lightning.ai/docs/pytorch/latest/levels/intermediate_level_14.html.

I followed them but did not work. As I mentioned in my previous post, there was 1) an issue with sampler (no elastic stuff involved) when using multiple nodes (I did not do something special with the sampler because some old tutorials mentioned I did not have to https://pytorch-lightning.readthedocs.io/en/0.9.0/multi_gpu.html):

ValueError: Invalid rank 2, rank should be in the interval [0, 1]

On the other hand, there was 2) an issue with rdzv communication error (timeout) when using elastic launch.

Pytorch lightning keeps evolving (e.g., API stuff), not enough material can be found online. So I am not sure how to deal with those two issues.

awaelchli commented 1 year ago

Okay, in this case may I ask why this needs to happen programmatically? Couldn't you just use the torchrun command externally? Afterall, you are parsing arguments yourself anyway? You could remove this boilerplate code entirely from the script and externally just launch it like so:

torchrun --nproc-per-node 2 --nnodes 2 --node-rank ... --rdzv-backend c10d --rdzv-endpoint ... --max-restarts 0 --run-id ... train_script.py

(don't forget to set the node rank, you also missed that in your elastic launch command above!!)

The only caveat is that if you do this on SLURM, you'll need to suppress the autodetection of the slurm environment and do

from lightning.pytorch.plugins import TorchElasticEnvironment
trainer = Trainer(plugins=TorchElasticEnvironment())

But I think this would still be cleaner.

Uio96 commented 1 year ago

Okay, in this case may I ask why this needs to happen programmatically? Couldn't you just use the torchrun command externally? Afterall, you are parsing arguments yourself anyway? You could remove this boilerplate code entirely from the script and externally just launch it like so:
torchrun --nproc-per-node 2 --nnodes 2 --node-rank ... --rdzv-backend c10d --rdzv-endpoint ... --max-restarts 0 --run-id ... train_script.py 
(don't forget to set the node rank, you also missed that in your elastic launch command above!!)

The only caveat is that if you do this on SLURM, you'll need to suppress the autodetection of the slurm environment and do
from lightning.pytorch.plugins import TorchElasticEnvironment
trainer = Trainer(plugins=TorchElasticEnvironment())
But I think this would still be cleaner.

Thank you so much. After changing to torchrun command externally and use TorchElasticEnvironment plugins, I can run the script with multiple nodes. It did not work out with my previous LaunchConfig (I once assumed they were equivalent).

Lightning-AI / pytorch-lightning