deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
182 stars 59 forks source link

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

Closed a-ys closed 2 weeks ago

a-ys commented 2 weeks ago

Description

Neo serving.properties output

Currently, the Neo Quantization script will always quantize at tensor_parallel_degree=8 and output tensor_parallel_degree=8 in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.

Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.

New behavior: Neo still quantizes with tensor_parallel_degree=8 but the output will depend on customer input to Neo.

Neo environment variables updates

We will accept SM_NEO_HF_CACHE_DIR as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.