ClickHouse / clickhouse-go

Golang driver for ClickHouse
Apache License 2.0
2.91k stars 562 forks source link

Performance of Inserting to Object('JSON') is bad on variable JSON #756

Open yuzhichang opened 2 years ago

yuzhichang commented 2 years ago
CREATE TABLE default.prom_extend_json ON CLUSTER abc (
    `timestamp` DateTime,
    `value` Float64,
    `__name__` String,
    `labels` JSON
) ENGINE=MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (__name__, timestamp);

CREATE TABLE default.dist_prom_extend_json ON CLUSTER abc AS prom_extend_json ENGINE = Distributed(abc, default, prom_extend_json);

Inserting a batch with 16384 records costs 48.275 seconds on average. The clickhouse-server side cpu usage is negligible. The client side usage is 100%. Its pprof is attached. profile001

gingerwizard commented 2 years ago

Are you JSON objects consistent with mostly the same fields? insertEmptyColumn should only be called when a row has a new field not present in the batch.

yuzhichang commented 2 years ago

Each my JSON object has 3 fields. There are 1000 different fields among all JSON objects.

The following code converts a fastjson.Value to a Go object which's passed to (*batch).Append later.

func val2map(v *fastjson.Value) (m map[string]interface{}) {
    var err error
    var obj *fastjson.Object
    if obj, err = v.Object(); err != nil {
        return
    }
    m = make(map[string]interface{}, obj.Len())
    obj.Visit(func(key []byte, v *fastjson.Value) {
        strKey := string(key)
        switch v.Type() {
        case fastjson.TypeString:
            var vb []byte
            if vb, err = v.StringBytes(); err != nil {
                return
            }
            m[strKey] = string(vb)
        case fastjson.TypeNumber:
            var f float64
            if f, err = v.Float64(); err != nil {
                return
            }
            m[strKey] = f
        }
    })
    return
}

I evaluated using 1000 Nullable(String) columns here.

gingerwizard commented 2 years ago

So the challenge is the fact each append requires us to update the batch retrospectively - suppose we add rows 1, 2 and 3 with fields A B C. Adding row 4 with column D, requires the previous to be updated - since the protocol requires columns to be of the same length. This worsens as the batch gets bigger.

You could insert as string - defering the work to CLickHouse and not placing the load on the client. This code defn needs to be optimized and i suspected non-consistent/uniform JSON would be slow. I'll see if we can do this faster though.

gingerwizard commented 2 years ago

The way to do this is probably to enhance the column format to be able to store backfill and forward fill positions - we could then use these during flush and encoding to save the work.