Support 8bit and 4bit inference

This PR adds a --load_in_8bit and --load_in_4bit flags and support 8bit and 4bit models inference. Addresses https://github.com/bigcode-project/bigcode-evaluation-harness/issues/91 (although SantaCoder has known issues with inference in fp16, and as a consequence it also does in 8bit -in particular with top-p sampling, greedy seems to work fine- outside the scope of this PR)

Tested on StarCoder for HumanEval with

accelerate launch  main.py   --model bigcode/starcoder   --max_length_generation 512  --tasks humaneval   --n_samples 20   --batch_size 20   --temperature 0.2    --allow_code_execution   --use_auth_token --load_in_8bit

And it seems to work properly on 4 GPUs

{
  "humaneval": {
    "pass@1": 0.3371951219512195,
    "pass@10": 0.5014858423687616
  },
  "config": {
    "model": "bigcode/starcoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 20
  }
}

load_in_4bit gives "pass@1": 0.35243902439024394 for the same parameters (users need to have bitsandbytes installed)

bigcode-project / bigcode-evaluation-harness

Support 8bit and 4bit inference #95