Closed sgsdxzy closed 3 weeks ago
The outer model.named_parameters()
loop during loading still keeps a reference to weights so we have to force release memory by setting .data=torch.empty(0, device="cpu")
.
q_invperm
is not used later in the kernel, probably we should remove it too after loading.
Sharding on CPU then copying the contiguous tensor reduces GPU vram bubble, increasing GPU blocks further (36k->41k ctx for me).
Reduce vram usage during exl2 model loading.