Samsung / ONE

On-device Neural Engine
Other
440 stars 157 forks source link

[nnpackage] Define block quantization type on circle format #13743

Closed hseok-oh closed 2 months ago

hseok-oh commented 3 months ago

What?

Let's support block quantization data type on circle format to support LLM model.

Why?

To support LLM model, we need to support small size weight quantization with small precision loss. So we need to introduce chunk quantization such as ggml (llama.cpp) 's quantization type.

To represent this, we need to expand circle schema's QuantizationParameters table or/and QuantizationDetails union.

Related issue: #13742

hseok-oh commented 3 months ago

Below is schema draft to represent ggml's quantization type (block quntization) Please give your opinion about this. @seanshpark @mhs4670go @jinevening @chunseoklee @glistening @jyoungyun @ragmani

https://github.com/Samsung/ONE/pull/13758/files#diff-8b2942eef0fd7474ef49ec5245f9d288bd6d62c94ef20689c24edf07ce77c095

// Block quantization: from ggml quantization (https://github.com/ggerganov/ggml)
table CircleBlockQuantization {
  name:string;
}

// Represents a specific quantization technique's parameters.
union QuantizationDetails {
  CustomQuantization,
  CircleBlockQuantization
}

// Parameters for converting a quantized tensor back to float.
table QuantizationParameters {
  // These four parameters are the asymmetric linear quantization parameters.
  // Given a quantized value q, the corresponding float value f should be:
  //   f = scale * (q - zero_point)
  // For other quantization types, the QuantizationDetails below is used.
  // NOTE min/max values are valid if
  // 1. length of min/max == 0 or
  // 2. length of min/max == length of scale/zero_point
  // Otherwise, min/max are not valid (undefined behavior).
  min:[float];
  max:[float];
  scale:[float];  // For dequantizing the tensor's values.
  zero_point:[long];
  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;
  // Specifies the dimension of the Tensor's shape that the scales and
  // zero_points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1]
  // with quantization params:
  //   scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1
  // will be quantized across the second dimension of t.
  //   t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
  //   t[:, 1, :, :] will have scale[1]=2.0, zero_point[0]=2
  //   t[:, 2, :, :] will have scale[2]=3.0, zero_point[0]=3
  quantized_dimension:int;
}

It introduces new QuantizationDetails's union table CircleBlockQuantization for detail field. details field is never used yet, so it will become first usage. CircleBlockQuantization has name field and it has ggml's quantization type name (ex. Q4_0, Q4_1, Q8_0, etc). If details field have any value, other QuantizationParameters field will not be used to decide quantization type. Quantization parameters such as scales are in buffer with quantized value, so there is no field to save deltas (scales) for each block, and it is same policy with ggml's quantization - dequantization.

Below is Q4_0 type block structure in buffer. https://github.com/ggerganov/ggml/blob/2438d62cb9290b5b5dc6228dec76fe81cf64238e/src/ggml-common.h#L144-L149

#define QK4_0 32
typedef struct {
    ggml_half d;           // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
static_assert(sizeof(block_q4_0) == sizeof(ggml_half) + QK4_0 / 2, "wrong q4_0 block size/padding");

(ggml_half: fp16)


Addition: @glistening 's comment https://github.com/Samsung/ONE/pull/13693#discussion_r1732497335

Is the prefix Circle necessary to avoid name conflict from flatbuffers generated files? I guess GGMLBlockQuantization may be better as @jinevening suggested offline. It makes it clear what CircleBlockQuantization means.

jinevening commented 3 months ago

How about adding a new dtype (QK4_0, etc) rather than extending CircleQuantParam? If parameters are saved with weights, we may not need additional data structure for qparam.

Why?

  1. Easy interpretation: We can identify new quantized tensors simply by its dtype (no need to see quantparam). And, it is a bit difficult to know Q4_0 is U4 (not S4) and Q8_0 is S8 (not U8).
  2. Better SW design: CircleQuantParam will have a single responsibility. It is only used for affine quantization.
  3. Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.
hseok-oh commented 3 months ago

@jinevening I've updated circle schema based on your comment

enum TensorType : byte {
  UINT4 = -1,
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
  Q4_0 = 18,
  Q4_1 = 19,
  Q8_0 = 20,
  Q8_1 = 21,
}

There is no issue on runtime to use this spec. @seanshpark @mhs4670go Is it OK to use this type on compiler?

seanshpark commented 3 months ago

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

hseok-oh commented 3 months ago

UINT4 = -1, was added not in tflite, so, does new Qx_y exist in tflite?

No. I'll update to use negative value.

hseok-oh commented 3 months ago

Updated

// The type of data stored in a tensor.
// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {
  FLOAT32 = 0,
  FLOAT16 = 1,
  INT32 = 2,
  UINT8 = 3,
  INT64 = 4,
  STRING = 5,
  BOOL = 6,
  INT16 = 7,
  COMPLEX64 = 8,
  INT8 = 9,
  FLOAT64 = 10,
  COMPLEX128 = 11,
  UINT64 = 12,
  // Experimental: Resource and variant types are experimental, that are subject
  // to change. Do not implement custom kernels using resource & variant types
  // now.
  RESOURCE = 13,
  VARIANT = 14,
  UINT32 = 15,
  UINT16 = 16,
  INT4 = 17,
  // Belows are using negative value to represent not existing TensorType on TensorFlow Lite schema
  UINT4 = -1,
  Q4_0 = -2,
  Q4_1 = -3,
  Q8_0 = -4,
  Q8_1 = -5,
}
seanshpark commented 3 months ago

negative value items are placed in the back.. does generated header code have no problem?

hseok-oh commented 3 months ago

negative value items are placed in the back.. does generated header code have no problem?

No problem. I checked generated header code.

hseok-oh commented 3 months ago

If there is no more opinion, I'll update generated header file for runtime first (runtime/libs/circle-schema/include/circle_schema_generated.h) based on this schema. IMO, we can update schema file with schema version up.after 1.29.0 release (https://github.com/Samsung/ONE/issues/13796) is finished.

glistening commented 3 months ago

I've found @jinevening's suggestion now. I think we need prefix before Q4_0 (e.g. BLK_Q4_0 or GGML_Q4_0). Without prefix, it may be considered as simple affine quantization.

(ADD)

I've found the comment on Q4_0, ... on top.

// Q4_0, Q4_1, Q8_0, Q8_1 are follow ggml quantization spec (https://github.com/ggerganov/ggml)
enum TensorType : byte {

It would be better to move the comment immediately before Q4_0, ... Still, personally I prefer more specific names instead of comment.

However, if others are ok, I don't oppose.

glistening commented 3 months ago

@jinevening

  1. Reduce side effect: CircleQuantParam is used in many places, so I'd like to minimize side effects.

What do you mean by CircleQuantParam?

Assuming you mean QuantizationParameters in circle schema, if something goes wrong by QuantizationDetails, it means it has some bug. It should check whether QuantizationDetails is null or not.

  // If this is not none, the other quantization parameters (i.e. min, max,
  // scale, zero_point fields above) are ignored and the value of the
  // QuantizationDetails union should be used.
  details:QuantizationDetails;

I think using QuantizationDetails has no problem. But as @jinevening suggested, if it only has name, we don't need to extend. I agree to add TensorType only.

hseok-oh commented 3 months ago

I think we agree to use new TensorType for ggml block quantization. So I'll update runtime's generated header file for next step. We can change type name until circle schema version up, and it does not make any implementation issue if we don't change enum's actual value because flatbuffers file does not save enum's name string.

And maybe it will be ok to change enum name after release because name is used for print out only.

jinevening commented 3 months ago

What do you mean by CircleQuantParam?

It's about existing cpp class in luci. It has been used for affine quantization only.

hseok-oh commented 2 months ago

Schema is updated.