PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

Fix/exl2 split #437

Closed sgsdxzy closed 3 weeks ago

sgsdxzy commented 3 weeks ago

Shard exl2 weights between ranks evenly by memory chunk sizes, not number of rows. Model: turboderp_command-r-plus-103B-exl2_4.5bpw, tp=4 Before: 16.12 GiB on GPU0, 15.18 GiB on GPU1, 54048 max ctx After: 15.21 GiB on GPU0, 15.20 GiB on GPU1, 67840 max ctx