ClickHouse / clickhouse-connect

Python driver/sqlalchemy/superset connectors
Apache License 2.0
323 stars 64 forks source link

Insert into JSON column does not work with `async_insert` setting enabled #409

Open denisov-vlad opened 1 week ago

denisov-vlad commented 1 week ago

Hello!

I'm facing with an insert issue with JSON column. Could you clarify how to correctly insert it please?

Here is the error string from ClickHouse:

2024.10.08 09:33:48.759669 [ 756 ] {722d0aba-b937-4826-a7e4-191b66cbddca} <Error> AsynchronousInsertQueue: Failed parsing for query 'INSERT INTO snowplow.json (app_id, page, event_type, session_id, device_id, time, page_data) FORMAT Native' with query id 76ecfe6e-00c6-47d9-be7f-96e936a73b07. DB::Exception: Cannot convert: String to JSON

I have a table

CREATE TABLE snowplow.json
(
    `app_id` LowCardinality(String),
    `page` String DEFAULT '',
    `event_type` Enum8('pv' = 1, 'pp' = 2, 'ue' = 3, 'se' = 4, 'tr' = 5, 'ti' = 6, 's' = 7),
    `session_id` UUID,
    `device_id` UUID,
    `time` DateTime64(3, 'UTC'),
    `page_data` JSON
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(time), event_type)
ORDER BY (app_id, toDate(time), event_type, page, device_id, cityHash64(device_id), session_id, time)
SAMPLE BY cityHash64(device_id)
SETTINGS index_granularity = 8192

I'm trying to insert data there:

data =  ['example', 'http://127.0.0.1:8000/demo/', 'pp', '94e0566f-5a03-4179-93c9-3acf28af95c6', 'b53c099a-45aa-4006-ae35-5716bcbecddc', datetime.datetime(2024, 10, 8, 9, 33, 48, 751039), {'id': 'qwe123', 'type': 'main', 'section': None}]
names =  ['app_id', 'page', 'event_type', 'session_id', 'device_id', 'time', 'page_data']
types =  [<clickhouse_connect.datatypes.string.String object at 0x7f5f3271f340>, <clickhouse_connect.datatypes.string.String object at 0x7f5f361d21c0>, <clickhouse_connect.datatypes.numeric.Enum8 object at 0x7f5f326be7b0>, <clickhouse_connect.datatypes.special.UUID object at 0x7f5f3271f160>, <clickhouse_connect.datatypes.special.UUID object at 0x7f5f3271f160>, <clickhouse_connect.datatypes.temporal.DateTime64 object at 0x7f5f32647150>, <clickhouse_connect.datatypes.dynamic.JSON object at 0x7f5f3271f2f0>]

conn.insert(
    table,
    data=[data],
    column_names=names,
    column_types=types,
)

JSON column type is defined as:

from clickhouse_connect.datatypes.dynamic import JSON
from clickhouse_connect.datatypes.base import TypeDef

JSON(type_def=TypeDef())
genzgd commented 1 week ago

That example works for me. What ClickHouse and clickhouse_connect versions are you using?

denisov-vlad commented 1 week ago

@genzgd

clickhouse-connect==0.8.3 clickhouse/clickhouse-server:24.9.2.42-alpine

genzgd commented 1 week ago

Here's my exact code, on ClickHouse version 24.9.2.42 (latest docker image):

import datetime

from clickhouse_connect import get_client
from clickhouse_connect.datatypes.registry import get_from_name

create_table = '''
CREATE TABLE IF NOT EXISTS snowplow_json
(
    `app_id` LowCardinality(String),
    `page` String DEFAULT '',
    `event_type` Enum8('pv' = 1, 'pp' = 2, 'ue' = 3, 'se' = 4, 'tr' = 5, 'ti' = 6, 's' = 7),
    `session_id` UUID,
    `device_id` UUID,
    `time` DateTime64(3, 'UTC'),
    `page_data` JSON
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(time), event_type)
ORDER BY (app_id, toDate(time), event_type, page, device_id, cityHash64(device_id), session_id, time)
SAMPLE BY cityHash64(device_id)
SETTINGS index_granularity = 8192'''

client = get_client(settings={'allow_experimental_json_type': 1})
client.command(create_table)

data = ['example', 'http://127.0.0.1:8000/demo/', 'pp', '94e0566f-5a03-4179-93c9-3acf28af95c6',
        'b53c099a-45aa-4006-ae35-5716bcbecddc', datetime.datetime(2024, 10, 8, 9, 33, 48, 751039),
        {'id': 'qwe123', 'type': 'main', 'section': None}]
names = ['app_id', 'page', 'event_type', 'session_id', 'device_id', 'time', 'page_data']
types = [get_from_name('String'),
         get_from_name('String'),
         get_from_name("Enum8('pv' = 1, 'pp' = 2, 'ue' = 3, 'se' = 4, 'tr' = 5, 'ti' = 6, 's' = 7)"),
         get_from_name('UUID'),
         get_from_name('UUID'),
         get_from_name("DateTime64(3, 'UTC')"),
         get_from_name('JSON')]
client.insert(
    'snowplow_json',
    data=[data],
    column_names=names,
    column_types=types,
)
SELECT *
FROM snowplow_json
FORMAT Vertical

Query id: c9389b42-5fb9-4101-9304-c6d91395449e

Row 1:
──────
app_id:     example
page:       http://127.0.0.1:8000/demo/
event_type: pp
session_id: 94e0566f-5a03-4179-93c9-3acf28af95c6
device_id:  b53c099a-45aa-4006-ae35-5716bcbecddc
time:       2024-10-08 15:33:48.751
page_data:  {"id":"qwe123","type":"main"}
genzgd commented 1 week ago

What OS is your alpine container running on? I can't think of any other differences.

denisov-vlad commented 1 week ago

@genzgd I've found the reason. It doesn't work with async_insert=1 parameter and works without it.

genzgd commented 1 week ago

Wow, thanks. That's a major problem. I'll open an issue in the main ClickHouse project.

genzgd commented 1 week ago

New ClickHouse server issue here: https://github.com/ClickHouse/ClickHouse/issues/70490