Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.42k stars 3.39k forks source link

MPS training crash: subRange.start (2) is not less than length of dimension[0] (2) #17934

Closed lidj22 closed 1 year ago

lidj22 commented 1 year ago

Bug description

I'm migrating a linear regression example to pytorch lightning on my m1 mac. When I switch my device from cpu to mps I get an error and a crash that sends me to the terminal prompt, followed by another error that hangs and says There appear to be X leaked semaphore objects...

Note that this issue occurs only when I set accelerator = mps, and when accelerator = cpu everything runs normally. So this seems to me like an issue having to do with mps.

Switching to pytorch nightly, or restarting the computer did not resolve the issue. Hopefully this is not a trivial problem like a wrongly formatted tensor...

To reproduce this example, run python main.py with the requirements.

It seems that this error message has occurred in at least one other project:

What version are you seeing the problem on?

v2.0

How to reproduce the bug

# main.py

import lightning.pytorch as pl
from multiprocessing import cpu_count
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader

class LinearRegressionDataModule(pl.LightningDataModule):
    def __init__(self, batch_size, a, b):
        super().__init__()
        self.batch_size = batch_size
        self.a = a
        self.b = b

    def setup(self, stage):
        xx = np.linspace(0, 1, 100)
        yy = self.a * xx + self.b

        # process data.
        xx = torch.tensor(xx, dtype=torch.float32)
        yy = torch.tensor(yy, dtype=torch.float32)

        # combine tensor
        xyxy = torch.stack((xx, yy)).T
        self.xyxy = xyxy

    def train_dataloader(self):
        loader = DataLoader(self.xyxy, batch_size=self.batch_size, num_workers=cpu_count())
        return loader

class LinearRegressionNet(pl.LightningModule):
    def __init__(self, learning_rate):
        super().__init__()
        self.l1 = nn.Linear(1, 1)
        self.criterion = nn.MSELoss()
        self.learning_rate = learning_rate

    def forward(self, x):
        return self.l1(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.reshape(-1, 1)
        y = y.reshape(-1, 1)
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        return loss

    def configure_optimizers(self):
        return optim.SGD(self.parameters(), lr=self.learning_rate)

def main():

    accelerator = "cpu"
    learning_rate = 0.01
    num_epochs = 10
    batch_size = 2
    a, b = 2, -1

    model = LinearRegressionNet(learning_rate)
    trainer = pl.Trainer(
        accelerator=accelerator,
        devices=1,
        max_epochs=num_epochs,
    )
    dm = LinearRegressionDataModule(batch_size, a, b)
    trainer.fit(model, datamodule=dm)

if __name__ == "__main__":
    main()

Error messages and logs

(venv) user@Laptop lightning-linear-regression % python main.py
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/user/Documents/Github/hello-world/lightning-linear-regression/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(

  | Name      | Type    | Params
--------------------------------------
0 | l1        | Linear  | 2     
1 | criterion | MSELoss | 0     
--------------------------------------
2         Trainable params
0         Non-trainable params
2         Total params
0.000     Total estimated model params size (MB)
Epoch 0:   0%|  

/AppleInternal/Library/BuildRoots/2acced82-df86-11ed-9b95-428477786501/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:84: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: subRange.start (2) is not less than length of dimension[0] (2)'
zsh: abort      python main.py
(venv) user@Laptop lightning-linear-regression % /opt/homebrew/Cellar/python@3.10/3.10.12/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 40 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Environment

# requirements.txt
torch >= 2.0
numpy >= 1.0
lightning >= 2.0

More info

After switching to pytorch nightly it seems the followup error (semaphore leak) has disappeared.

cc @justusschock

awaelchli commented 1 year ago

@lidj22 Thanks for reporting. Have you tried to run your script with PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py ...? You mentioned switching your existing PyTorch code to Lightning, so I take it that you are reporting this because the previous code was working fine on MPS?

lidj22 commented 1 year ago

@awaelchli Hi, I ran PYTORCH_ENABLE_MPS_FALLBACK=1 but the error remains the same.

Yes, the code was previously written in only pytorch, and encountered no issues.

awaelchli commented 1 year ago

Can you share it here so I can see what the difference is between the raw pytorch and the Lightning converted code you posted above?

lidj22 commented 1 year ago

Sure; I didn't keep the previous torch code, so this is just rewritten from the code I provided earlier:

# main.py

from multiprocessing import cpu_count
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader

class LinearRegressionNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(1, 1)

    def forward(self, x):
        return self.l1(x)

def load_data(weight, bias):
    xx = np.linspace(0, 1, 100)
    yy = weight * xx + bias
    xx = torch.tensor(xx, dtype=torch.float32)
    yy = torch.tensor(yy, dtype=torch.float32)
    xx = xx.reshape(-1, 1)
    yy = yy.reshape(-1, 1)
    return xx, yy

def main():

    # device
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using MPS.")
    else:
        device = torch.device("cpu")
        print("Using CPU.")

    learning_rate = 0.01
    num_epochs = 10
    a, b = 2, -1

    # model criterion optimizer
    model = LinearRegressionNet()
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

    # start training
    model.to(device)
    for epoch in range(num_epochs):
        xx, yy = load_data(a, b)
        xx, yy = xx.to(device), yy.to(device)

        # reset
        optimizer.zero_grad()

        # inference
        yy_hat = model(xx)
        loss = criterion(yy_hat, yy)

        # grad
        loss.backward()
        optimizer.step()

        # # for debug
        # with torch.no_grad():
        #     print("loss: ", loss.cpu())

if __name__ == "__main__":
    main()
awaelchli commented 1 year ago

@lidj22 Are you aware that in your code you're indexing incorrectly into your batch? Your code only works by coincidence because batch size is 2.

When you do

x, y = batch

you are splitting the tensor named "batch" of shape [2, 2] into two tensors x and y, but the splitting happens along the dimension 0, which is the batch size and can vary. What you intend to do is to split along the dimension 0, which is intended to be of fixed size 2, one for x and one for y.

You can easily see this mistake if you increase your batch size to a value other than 2. To fix this, you can for example insert this line in the setup method:

xyxy = torch.stack((xx, yy)).T
xyxy = [(data[0], data[1]) for data in xyxy]  # <-- add this
self.xyxy = xyxy

This way, the "batch" you receive in training_step will get a tuple (x, y), where each x and y has shape [batch_size, ...]. Then when you do x, y = batch the tuple unpacking will work. With this modification, I get the example to run normally.

lidj22 commented 1 year ago

Thanks for clearing this up! Batching and indices has always been a pain point for me. I guess the part I missed was that the getitem format must be a tuple, from inspecting the differences in return output.

>>> xx = np.array([1, 2])
>>> yy = -xx
>>> xx, yy = torch.tensor(xx, dtype=torch.float32), torch.tensor(yy, dtype=torch.float32)
>>> xyxy = torch.stack((xx, yy)).T
>>> xyxy
tensor([[ 1., -1.],
        [ 2., -2.]])
>>> [(data[0], data[1]) for data in xyxy]
[(tensor(1.), tensor(-1.)), (tensor(2.), tensor(-2.))]
>>> xyxy[0]
tensor([ 1., -1.])
>>> [(data[0], data[1]) for data in xyxy][0]
(tensor(1.), tensor(-1.))
>>> 

The code works now, thanks again.

awaelchli commented 1 year ago

I'm glad it was useful. Cheers!