[Neo] Fix Neo Quantization properties output. Add some additional configuration.

Description

Neo serving.properties output

Currently, the Neo Quantization script will always quantize at tensor_parallel_degree=8 and output tensor_parallel_degree=8 in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.

Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.

New behavior: Neo still quantizes with tensor_parallel_degree=8 but the output will depend on customer input to Neo.

If a customer passes tensor_parallel_degree in serving.properties or through the environment variable (but not both):
- The inputted tensor_parallel_degree will be passed through to the output.
If a customer passes tensor_parallel_degree in serving.properties AND the environment variable:
- The ENVVAR tensor_parallel_degree will be passed through to the output.
If a customer does not pass either:
- tensor_parallel_degree will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.

Neo environment variables updates

We will accept SM_NEO_HF_CACHE_DIR as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.

deepjavalibrary / djl-serving