Closed hseok-oh closed 2 months ago
Below is schema draft to represent ggml's quantization type (block quntization) Please give your opinion about this. @seanshpark @mhs4670go @jinevening @chunseoklee @glistening @jyoungyun @ragmani
// Block quantization: from ggml quantization (https://github.com/ggerganov/ggml)
table CircleBlockQuantization {
name:string;
}
// Represents a specific quantization technique's parameters.
union QuantizationDetails {
CustomQuantization,
CircleBlockQuantization
}
// Parameters for converting a quantized tensor back to float.
table QuantizationParameters {
// These four parameters are the asymmetric linear quantization parameters.
// Given a quantized value q, the corresponding float value f should be:
// f = scale * (q - zero_point)
// For other quantization types, the QuantizationDetails below is used.
// NOTE min/max values are valid if
// 1. length of min/max == 0 or
// 2. length of min/max == length of scale/zero_point
// Otherwise, min/max are not valid (undefined behavior).
min:[float];
max:[float];
scale:[float]; // For dequantizing the tensor's values.
zero_point:[long];
// If this is not none, the other quantization parameters (i.e. min, max,
// scale, zero_point fields above) are ignored and the value of the
// QuantizationDetails union should be used.
details:QuantizationDetails;
// Specifies the dimension of the Tensor's shape that the scales and
// zero_points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1]
// with quantization params:
// scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1
// will be quantized across the second dimension of t.
// t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
// t[:, 1, :, :] will have scale[1]=2.0, zero_point[0]=2
// t[:, 2, :, :] will have scale[2]=3.0, zero_point[0]=3
quantized_dimension:int;
}
It introduces new QuantizationDetails
's union table CircleBlockQuantization
for detail
field. details
field is never used yet, so it will become first usage. CircleBlockQuantization
has name field and it has ggml's quantization type name (ex. Q4_0
, Q4_1
, Q8_0
, etc). If details
field have any value, other QuantizationParameters
field will not be used to decide quantization type.
Quantization parameters such as scales are in buffer with quantized value, so there is no field to save deltas (scales) for each block, and it is same policy with ggml's quantization - dequantization.
Below is Q4_0
type block structure in buffer.
https://github.com/ggerganov/ggml/blob/2438d62cb9290b5b5dc6228dec76fe81cf64238e/src/ggml-common.h#L144-L149
#define QK4_0 32
typedef struct {
ggml_half d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
static_assert(sizeof(block_q4_0) == sizeof(ggml_half) + QK4_0 / 2, "wrong q4_0 block size/padding");
(ggml_half
: fp16)
Addition: @glistening 's comment https://github.com/Samsung/ONE/pull/13693#discussion_r1732497335
Is the prefix
Circle
necessary to avoid name conflict from flatbuffers generated files? I guessGGMLBlockQuantization
may be better as @jinevening suggested offline. It makes it clear whatCircleBlockQuantization
means.
How about adding a new dtype (QK4_0, etc) rather than extending CircleQuantParam
? If parameters are saved with weights, we may not need additional data structure for qparam.
Why?
CircleQuantParam
will have a single responsibility. It is only used for affine quantization.CircleQuantParam
is used in many places, so I'd like to minimize side effects.@jinevening I've updated circle schema based on your comment
enum TensorType : byte {
UINT4 = -1,
FLOAT32 = 0,
FLOAT16 = 1,
INT32 = 2,
UINT8 = 3,
INT64 = 4,
STRING = 5,
BOOL = 6,
INT16 = 7,
COMPLEX64 = 8,
INT8 = 9,
FLOAT64 = 10,
COMPLEX128 = 11,
UINT64 = 12,
// Experimental: Resource and variant types are experimental, that are subject
// to change. Do not implement custom kernels using resource & variant types
// now.
RESOURCE = 13,
VARIANT = 14,
UINT32 = 15,
UINT16 = 16,
INT4 = 17,
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
Q4_0 = 18,
Q4_1 = 19,
Q8_0 = 20,
Q8_1 = 21,
}
There is no issue on runtime to use this spec. @seanshpark @mhs4670go Is it OK to use this type on compiler?
UINT4 = -1,
was added not in tflite, so, does new Qx_y
exist in tflite?
UINT4 = -1,
was added not in tflite, so, does newQx_y
exist in tflite?
No. I'll update to use negative value.
Updated
// The type of data stored in a tensor.
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {
FLOAT32 = 0,
FLOAT16 = 1,
INT32 = 2,
UINT8 = 3,
INT64 = 4,
STRING = 5,
BOOL = 6,
INT16 = 7,
COMPLEX64 = 8,
INT8 = 9,
FLOAT64 = 10,
COMPLEX128 = 11,
UINT64 = 12,
// Experimental: Resource and variant types are experimental, that are subject
// to change. Do not implement custom kernels using resource & variant types
// now.
RESOURCE = 13,
VARIANT = 14,
UINT32 = 15,
UINT16 = 16,
INT4 = 17,
// Belows are using negative value to represent not existing TensorType on TensorFlow Lite schema
UINT4 = -1,
Q4_0 = -2,
Q4_1 = -3,
Q8_0 = -4,
Q8_1 = -5,
}
negative value items are placed in the back.. does generated header code have no problem?
negative value items are placed in the back.. does generated header code have no problem?
No problem. I checked generated header code.
If there is no more opinion, I'll update generated header file for runtime first (runtime/libs/circle-schema/include/circle_schema_generated.h
) based on this schema.
IMO, we can update schema file with schema version up.after 1.29.0 release (https://github.com/Samsung/ONE/issues/13796) is finished.
I've found @jinevening's suggestion now. I think we need prefix before Q4_0
(e.g. BLK_Q4_0
or GGML_Q4_0
). Without prefix, it may be considered as simple affine quantization.
(ADD)
I've found the comment on Q4_0
, ... on top.
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {
It would be better to move the comment immediately before Q4_0
, ...
Still, personally I prefer more specific names instead of comment.
However, if others are ok, I don't oppose.
@jinevening
- Reduce side effect:
CircleQuantParam
is used in many places, so I'd like to minimize side effects.
What do you mean by CircleQuantParam
?
Assuming you mean QuantizationParameters
in circle schema, if something goes wrong by QuantizationDetails
, it means it has some bug. It should check whether QuantizationDetails
is null or not.
// If this is not none, the other quantization parameters (i.e. min, max,
// scale, zero_point fields above) are ignored and the value of the
// QuantizationDetails union should be used.
details:QuantizationDetails;
I think using QuantizationDetails
has no problem. But as @jinevening suggested, if it only has name
, we don't need to extend. I agree to add TensorType only.
I think we agree to use new TensorType
for ggml block quantization. So I'll update runtime's generated header file for next step.
We can change type name until circle schema version up, and it does not make any implementation issue if we don't change enum's actual value because flatbuffers file does not save enum's name string.
And maybe it will be ok to change enum name after release because name is used for print out only.
What do you mean by CircleQuantParam?
It's about existing cpp class in luci
. It has been used for affine quantization only.
Schema is updated.
What?
Let's support block quantization data type on circle format to support LLM model.
Why?
To support LLM model, we need to support small size weight quantization with small precision loss. So we need to introduce chunk quantization such as ggml (llama.cpp) 's quantization type.
To represent this, we need to expand circle schema's
QuantizationParameters
table or/andQuantizationDetails
union.Related issue: #13742