FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

question about quantization #119

Open xinhaoc opened 1 year ago

xinhaoc commented 1 year ago

Hi FlexGen team! I have a question about your quantization algorithm. are you using this function run_float_quantization for int4/int8 compression? When I run the test(test_float_quantize), it fails because the params is different with the deepspeed version(the ref_out_tensor is the same). the deepspeed param can recover the float16 tensor, the run_float_quantize can't. Thanks!