fsdp(jit(...)) transform can use more memory compared to jit(fsdp(...))

kshitij12345 commented 1 month ago

As fsdp(jit(...)) holds on to the original parameters as well as the sharded parameters, it can lead to higher memory usage. I think a work-around can be to initialize the original model on meta device. But if using meta is the only correct way then we should add a warning if user does otherwise.

import os
import torch
import torch.distributed as tdist
import thunder
import thunder.distributed

if __name__ == "__main__":
    tdist.init_process_group(backend="nccl")
    LOCAL_RANK = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", LOCAL_RANK)
    torch.set_default_device(device)

    class Model(torch.nn.Module):
        def __init__(self) -> None:
            super().__init__()
            self.p1 = torch.nn.Parameter(torch.ones(1024, 1024))

        def forward(self, x):
            return self.p1 + x

    with device:
        model = Model()
        input_t = torch.randn(1)

    # jit(fsdp(...))
    # Memory Allocated - 6291968
    # model = thunder.distributed.fsdp(model)
    # model = thunder.jit(model, executors=["torch"])

    # fsdp(jit(...))
    # Memory Allocated - 10486272
    model = thunder.jit(model, executors=["torch"])
    model = thunder.distributed.fsdp(model)

    _ = model(input_t)

    if LOCAL_RANK == 0:
        print(torch.cuda.memory_allocated())

cc: @t-vi

cc @carmocca @awaelchli @crcrpar

t-vi commented 1 month ago

Indeed, and this is tricky:

To my mind, using meta is a great way to keep memory in check, but
meta does not help us when we want to keep the weight's values. Maybe we should offer to replace the original weights with meta equivalents as an option?
at any rate, we would probably want to do a sane thing for loading / saving state dicts

mruberry commented 4 weeks ago

triage review:

let's mutate the module we're given (so we don't preserve its tensors in memory and use more memory)
practitioners can preserve the current behavior by copying the module before giving it to jit
we should be careful that retracing has the information needed (possibly observing original values) to work as expected

crcrpar commented 2 weeks ago

@mruberry

we should be careful that retracing has the information needed (possibly observing original values) to work as expected

could you elaborate on what it means?

t-vi commented 2 weeks ago

Two parts:

The tracing (= new run through the Python interpreter, on cache miss) needs to happen on the original module (currently does, we won't be changing that).
When we want to move tensors we override to "meta", we need to be sure that this does not screw up the tracing (likely by keeping the original device somewhere and using that when constructing the TensorProxy).

I would like to see the solution to #483 / #564 enabling moving materialization out of the sharding and do it before we run the model and propagate data through what we have for #483 (which needs to deal with "has been moved to meta", too).

Lightning-AI / lightning-thunder

fsdp(jit(...)) transform can use more memory compared to jit(fsdp(...)) #478