Closed lidj22 closed 1 year ago
@lidj22 Thanks for reporting.
Have you tried to run your script with PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py ...
? You mentioned switching your existing PyTorch code to Lightning, so I take it that you are reporting this because the previous code was working fine on MPS?
@awaelchli Hi, I ran PYTORCH_ENABLE_MPS_FALLBACK=1
but the error remains the same.
Yes, the code was previously written in only pytorch, and encountered no issues.
Can you share it here so I can see what the difference is between the raw pytorch and the Lightning converted code you posted above?
Sure; I didn't keep the previous torch code, so this is just rewritten from the code I provided earlier:
# main.py
from multiprocessing import cpu_count
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
class LinearRegressionNet(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(1, 1)
def forward(self, x):
return self.l1(x)
def load_data(weight, bias):
xx = np.linspace(0, 1, 100)
yy = weight * xx + bias
xx = torch.tensor(xx, dtype=torch.float32)
yy = torch.tensor(yy, dtype=torch.float32)
xx = xx.reshape(-1, 1)
yy = yy.reshape(-1, 1)
return xx, yy
def main():
# device
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS.")
else:
device = torch.device("cpu")
print("Using CPU.")
learning_rate = 0.01
num_epochs = 10
a, b = 2, -1
# model criterion optimizer
model = LinearRegressionNet()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# start training
model.to(device)
for epoch in range(num_epochs):
xx, yy = load_data(a, b)
xx, yy = xx.to(device), yy.to(device)
# reset
optimizer.zero_grad()
# inference
yy_hat = model(xx)
loss = criterion(yy_hat, yy)
# grad
loss.backward()
optimizer.step()
# # for debug
# with torch.no_grad():
# print("loss: ", loss.cpu())
if __name__ == "__main__":
main()
@lidj22 Are you aware that in your code you're indexing incorrectly into your batch? Your code only works by coincidence because batch size is 2.
When you do
x, y = batch
you are splitting the tensor named "batch" of shape [2, 2] into two tensors x and y, but the splitting happens along the dimension 0, which is the batch size and can vary. What you intend to do is to split along the dimension 0, which is intended to be of fixed size 2, one for x and one for y.
You can easily see this mistake if you increase your batch size to a value other than 2. To fix this, you can for example insert this line in the setup method:
xyxy = torch.stack((xx, yy)).T
xyxy = [(data[0], data[1]) for data in xyxy] # <-- add this
self.xyxy = xyxy
This way, the "batch" you receive in training_step will get a tuple (x, y), where each x and y has shape [batch_size, ...]. Then when you do x, y = batch
the tuple unpacking will work.
With this modification, I get the example to run normally.
Thanks for clearing this up! Batching and indices has always been a pain point for me. I guess the part I missed was that the getitem format must be a tuple, from inspecting the differences in return output.
>>> xx = np.array([1, 2])
>>> yy = -xx
>>> xx, yy = torch.tensor(xx, dtype=torch.float32), torch.tensor(yy, dtype=torch.float32)
>>> xyxy = torch.stack((xx, yy)).T
>>> xyxy
tensor([[ 1., -1.],
[ 2., -2.]])
>>> [(data[0], data[1]) for data in xyxy]
[(tensor(1.), tensor(-1.)), (tensor(2.), tensor(-2.))]
>>> xyxy[0]
tensor([ 1., -1.])
>>> [(data[0], data[1]) for data in xyxy][0]
(tensor(1.), tensor(-1.))
>>>
The code works now, thanks again.
I'm glad it was useful. Cheers!
Bug description
I'm migrating a linear regression example to pytorch lightning on my m1 mac. When I switch my device from
cpu
tomps
I get an error and a crash that sends me to the terminal prompt, followed by another error that hangs and saysThere appear to be X leaked semaphore objects...
Note that this issue occurs only when I set
accelerator = mps
, and whenaccelerator = cpu
everything runs normally. So this seems to me like an issue having to do with mps.Switching to pytorch nightly, or restarting the computer did not resolve the issue. Hopefully this is not a trivial problem like a wrongly formatted tensor...
To reproduce this example, run
python main.py
with the requirements.It seems that this error message has occurred in at least one other project:
What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
More info
After switching to pytorch nightly it seems the followup error (semaphore leak) has disappeared.
cc @justusschock