This PR adds a --load_in_8bit and --load_in_4bit flags and support 8bit and 4bit models inference. Addresses https://github.com/bigcode-project/bigcode-evaluation-harness/issues/91 (although SantaCoder has known issues with inference in fp16, and as a consequence it also does in 8bit -in particular with top-p sampling, greedy seems to work fine- outside the scope of this PR)
This PR adds a
--load_in_8bit
and--load_in_4bit
flags and support 8bit and 4bit models inference. Addresses https://github.com/bigcode-project/bigcode-evaluation-harness/issues/91 (although SantaCoder has known issues with inference in fp16, and as a consequence it also does in 8bit -in particular with top-p sampling, greedy seems to work fine- outside the scope of this PR)Tested on StarCoder for HumanEval with
And it seems to work properly on 4 GPUs
load_in_4bit
gives"pass@1": 0.35243902439024394
for the same parameters (users need to havebitsandbytes
installed)