Open yuzhichang opened 2 years ago
Are you JSON objects consistent with mostly the same fields? insertEmptyColumn should only be called when a row has a new field not present in the batch.
Each my JSON object has 3 fields. There are 1000 different fields among all JSON objects.
The following code converts a fastjson.Value
to a Go object which's passed to (*batch).Append
later.
func val2map(v *fastjson.Value) (m map[string]interface{}) {
var err error
var obj *fastjson.Object
if obj, err = v.Object(); err != nil {
return
}
m = make(map[string]interface{}, obj.Len())
obj.Visit(func(key []byte, v *fastjson.Value) {
strKey := string(key)
switch v.Type() {
case fastjson.TypeString:
var vb []byte
if vb, err = v.StringBytes(); err != nil {
return
}
m[strKey] = string(vb)
case fastjson.TypeNumber:
var f float64
if f, err = v.Float64(); err != nil {
return
}
m[strKey] = f
}
})
return
}
I evaluated using 1000 Nullable(String)
columns here.
So the challenge is the fact each append requires us to update the batch retrospectively - suppose we add rows 1, 2 and 3 with fields A B C. Adding row 4 with column D, requires the previous to be updated - since the protocol requires columns to be of the same length. This worsens as the batch gets bigger.
You could insert as string - defering the work to CLickHouse and not placing the load on the client. This code defn needs to be optimized and i suspected non-consistent/uniform JSON would be slow. I'll see if we can do this faster though.
The way to do this is probably to enhance the column format to be able to store backfill and forward fill positions - we could then use these during flush and encoding to save the work.
Inserting a batch with 16384 records costs 48.275 seconds on average. The clickhouse-server side cpu usage is negligible. The client side usage is 100%. Its pprof is attached.