Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.09k stars 64 forks source link

autocast is incorrectly applied even if the requested device is different. #709

Open kshitij12345 opened 3 weeks ago

kshitij12345 commented 3 weeks ago

From the example below, the autocast is applied only for device cuda, however thunder.jit still applies it to CPU inputs.

import thunder
import torch

def foo(x, w):
    return torch.nn.functional.linear(x, w)

device = torch.device("cpu")
with device:
    x, w = torch.randn(16, 16), torch.randn(16, 16)
    print(x.dtype, w.dtype)

jfoo = thunder.jit(foo)

# Autocast is applied to different device.
with torch.autocast("cuda", torch.bfloat16):
    jit_out = jfoo(x, w)

print(thunder.last_traces(jfoo)[-1])

Output

# Constructed by Delete Last Used (took 0 milliseconds)
from torch import Tensor
import torch
import torch.nn.functional
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(x, w):
  # x: "cpu f32[16, 16]"
  # w: "cpu f32[16, 16]"
  t0 = Tensor.to(x, torch.bfloat16, copy=True)  # t0: "cpu bf16[16, 16]"
    # t0 = ltorch.to(x, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None)  # t0: "cpu bf16[16, 16]"
      # t0 = prims.convert_element_type(x, dtypes.thunder.dtypes.bfloat16)  # t0: "cpu bf16[16, 16]"
  del x
  t1 = Tensor.to(w, torch.bfloat16, copy=True)  # t1: "cpu bf16[16, 16]"
    # t1 = ltorch.to(w, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None)  # t1: "cpu bf16[16, 16]"
      # t1 = prims.convert_element_type(w, dtypes.thunder.dtypes.bfloat16)  # t1: "cpu bf16[16, 16]"
  del w
  t2 = torch.nn.functional.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
    # t2 = ltorch.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
      # t2 = prims.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
  del t0, t1
  return t2

cc @crcrpar

lantiga commented 3 weeks ago

that's right, in autocast we don't consider device

https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/core/transforms.py#L3788

does this have practical impacts on target models?

kshitij12345 commented 3 weeks ago

AFAIK, NeMo does use autocast. With our current implementation, we may silently add conversions when user asked to apply autocast only on a certain device and if there are operations happening on both CPU and GPU in that context. Honestly, I don't think it happens in practice.

@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?

tfogal commented 2 weeks ago

@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?

I don't know, sorry :-( @athitten might.

But I agree with you that it is unlikely---we could just not support it for now. But I would ask that we 'loudly' not support mixed-device autocast: can we check for this case and error out when it happens?

lantiga commented 2 weeks ago

I’m 100% for failing loudly if it’s not a beaten path (and this one looks like it’s not)

tfogal commented 2 weeks ago

triage review: