Split the exl2 weight ASAP.

PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine

https://pygmalion.chat

GNU Affero General Public License v3.0

606 stars 78 forks source link

Split the exl2 weight ASAP. #423

Closed sgsdxzy closed 3 weeks ago

sgsdxzy commented 4 weeks ago

Reduce vram usage during exl2 model loading.

sgsdxzy commented 4 weeks ago

The outer model.named_parameters() loop during loading still keeps a reference to weights so we have to force release memory by setting .data=torch.empty(0, device="cpu"). q_invperm is not used later in the kernel, probably we should remove it too after loading.

sgsdxzy commented 4 weeks ago

Sharding on CPU then copying the contiguous tensor reduces GPU vram bubble, increasing GPU blocks further (36k->41k ctx for me).