Potentially extra slow inference when using LoRA adapter

Hello, everybody. Tried HumanEval benchmark on my custom Mistral tune today, but getting weird warning:

UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(f'Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.')

Dont know how to fix this, any ideas?

My command to run the benchmark:

accelerate launch  main.py \
  --model {model_name} \
  --peft_model {peft_model_path} \
  --load_in_4bit \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --precision bf16 \
  --n_samples 200 \
  --batch_size 32 \
  --allow_code_execution \
  --limit 25

bigcode-project / bigcode-evaluation-harness

Potentially extra slow inference when using LoRA adapter #192