PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

Split the exl2 weight ASAP. #423

Closed sgsdxzy closed 3 weeks ago

sgsdxzy commented 4 weeks ago

Reduce vram usage during exl2 model loading.

sgsdxzy commented 4 weeks ago

The outer model.named_parameters() loop during loading still keeps a reference to weights so we have to force release memory by setting .data=torch.empty(0, device="cpu"). q_invperm is not used later in the kernel, probably we should remove it too after loading.

sgsdxzy commented 4 weeks ago

Sharding on CPU then copying the contiguous tensor reduces GPU vram bubble, increasing GPU blocks further (36k->41k ctx for me).