Currently, the Neo Quantization script will always quantize at tensor_parallel_degree=8 and output tensor_parallel_degree=8 in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.
Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.
New behavior: Neo still quantizes with tensor_parallel_degree=8 but the output will depend on customer input to Neo.
If a customer passes tensor_parallel_degree in serving.properties or through the environment variable (but not both):
The inputted tensor_parallel_degree will be passed through to the output.
If a customer passes tensor_parallel_degree in serving.properties AND the environment variable:
The ENVVAR tensor_parallel_degree will be passed through to the output.
If a customer does not pass either:
tensor_parallel_degree will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.
Neo environment variables updates
We will accept SM_NEO_HF_CACHE_DIR as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.
Description
Neo serving.properties output
Currently, the Neo Quantization script will always quantize at
tensor_parallel_degree=8
and outputtensor_parallel_degree=8
in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.
New behavior: Neo still quantizes with
tensor_parallel_degree=8
but the output will depend on customer input to Neo.tensor_parallel_degree
in serving.properties or through the environment variable (but not both):tensor_parallel_degree
will be passed through to the output.tensor_parallel_degree
in serving.properties AND the environment variable:tensor_parallel_degree
will be passed through to the output.tensor_parallel_degree
will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.Neo environment variables updates
We will accept
SM_NEO_HF_CACHE_DIR
as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.