Closed achew010 closed 1 month ago
@achew010 can you update the top-level comment, with what was the previous memory allocation, and verify that the new measurements are obtained after reversing the hack in https://github.com/foundation-model-stack/fms-acceleration/pull/26/commits/80d631e64cd78d97c079b0346d90079e56d9f5f7
Description
This PR addresses #18 with the following contributions
make_sure_no_tensor_in_meta_device
to avoid raising an error when model has no bias in low memory modedevice_map
tocpu
when loading checkpoints to avoid gpu memory consumption before trainer initialization. Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it tometa
device. QLoRA currently loads quantized models tocpu
in low memory mode as well. See here.TODO:
meta
deviceTests
Reproduction command
Comparison
Before Fix:
name
config
gpus
train
batch
size
mem reserved
(GiB)
mem alloc
(GiB)
mem alloc
(GiB)
After Fix:
name
config
gpus
train
batch
size
mem reserved
(GiB)
mem alloc
(GiB)
mem alloc
(GiB)