ClickHouse / clickhouse-js

Official JS client for ClickHouse DB
https://clickhouse.com
Apache License 2.0
205 stars 25 forks source link

Allow to specify a list of columns for `insert` #217

Closed cjroebuck closed 7 months ago

cjroebuck commented 8 months ago

Describe the bug

Not sure if this is a clickhouse-js specific bug or just clickhouse in general.

Whenever I try to insert JSON as a raw String with EPHEMERAL column and then have derived columns using various JSONExtract* functions, I can't seem to get it to work. I'm using the example from the JSON docs here.

Steps to reproduce

Run the following code to setup tables and insert test rows.

import { createClient } from "@clickhouse/client"; 

void (async () => {
  const client = createClient({
    clickhouse_settings: {
      date_time_input_format: "best_effort",
      allow_experimental_object_type: 1,
    },
  });

  await client.command({
    query: `DROP TABLE IF EXISTS github_json_ephemeral`,
  });

  await client.command({
    query: `CREATE table github_json_ephemeral
  (
     event_type LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'type'),
     repo_name LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'repo.name'),
     message JSON DEFAULT message_raw,
     message_raw String EPHEMERAL
  ) ENGINE = MergeTree ORDER BY (event_type, repo_name)`,
  });

  await client.insert({
    table: "github_json_ephemeral",
    format: "JSON",
    values: [
      {
        message_raw: {
          type: "PushEvent",
          created_at: "2022-01-04 07:00:00",
          actor: {
            avatar_url: "https://avatars.githubusercontent.com/u/41898282?",
            display_login: "github-actions",
            gravatar_id: "",
            id: 41898282,
            login: "github-actions[bot]",
            url: "https://api.github.com/users/github-actions[bot]",
          },
          repo: {
            id: 410071248,
            name: "pioug/yield-data",
            url: "https://api.github.com/repos/pioug/yield-data",
          },
        },
      },
    ],
  });
})

Expected behaviour

Output:

rows [
  {
    event_type: 'PushEvent',
    repo_name: 'pioug/yield-data',
    message: {
      actor: [Object],
      created_at: '2022-01-04 07:00:00',
      repo: [Object],
      type: 'PushEvent'
    },
  }

Actual behaviour

The derived columns are blank and the JSON column is empty:

rows [
  { event_type: '', repo_name: '', message: { _dummy: 0 } }
]

Removing the EPHEMERAL directive from the column, I get the following output:

rows [
  {
    event_type: 'PushEvent',
    repo_name: '',
    message: {
      actor: [Object],
      created_at: '2022-01-04 07:00:00',
      repo: [Object],
      type: 'PushEvent'
    },
    message_raw: '{"type":"PushEvent","created_at":"2022-01-04 07:00:00","actor":{"avatar_url":"https://avatars.githubusercontent.com/u/41898282?","display_login":"github-actions","gravatar_id":"","id":41898282,"login":"github-actions[bot]","url":"https://api.github.com/users/github-actions[bot]"},"repo":{"id":410071248,"name":"pioug/yield-data","url":"https://api.github.com/repos/pioug/yield-data"}}'
  }

Notice that repo_name field is empty but event_type has been successfully extracted from the JSON. So it seems the extract from repo.name is also not working. This is a secondary issue.

Configuration

Environment

ClickHouse server

slvrtrn commented 8 months ago

I think this is because the right syntax for an insert operation here will be:

insert into github_json_ephemeral (message_raw) values ('...')

^ see only message_raw in the insert clause, which triggers default values for the rest of the fields (https://clickhouse.com/docs/en/sql-reference/statements/create/table#ephemeral).

We can add an option to restrict the insert columns, as it is useful.

The only option I see as a workaround before we add this feature is to use command with VALUES instead. It's not the optimal solution, as you might encounter request size limits (since the values are not streamed):

import { createClient } from '@clickhouse/client'

void (async () => {
  const table = 'github_json_ephemeral'
  const client = createClient({
    clickhouse_settings: {
      date_time_input_format: 'best_effort',
      allow_experimental_object_type: 1,
    },
  })

  await client.command({
    query: `
      CREATE OR REPLACE TABLE ${table}
      (
        event_type  LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'type'),
        repo_name   LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'repo', 'name'),
        message     JSON                   DEFAULT message_raw,
        message_raw String                 EPHEMERAL
      ) ENGINE = MergeTree ORDER BY (event_type, repo_name)`,
  })

  const messagesRaw = [
    {
      type: 'MyEventType',
      repo: {
        name: 'foo',
      },
    },
    {
      type: 'SomeOtherType',
      repo: {
        name: 'bar',
      },
    },
  ]

  // Generates ('val1'),('val2'), ...
  const messagesRawValues = messagesRaw
    .map((msg) => {
      return `('${JSON.stringify(msg)}')`
    })
    .join(',')

  const insertQuery = `INSERT INTO ${table} (message_raw) VALUES ${messagesRawValues}`
  await client.command({
    query: insertQuery,
  })

  const result = await client.query({
    query: `SELECT * FROM ${table}`,
    format: 'JSONCompactEachRow',
  })
  console.log('Result:', JSON.stringify(await result.json()))
})()

Prints:

Result: [["MyEventType","foo",{"repo":{"name":"foo"},"type":"MyEventType"}],["SomeOtherType","bar",{"repo":{"name":"bar"},"type":"SomeOtherType"}]]

NB: see JSONExtractString(message_raw, 'repo', 'name') - it should correctly extract the repo name.

slvrtrn commented 7 months ago

@cjroebuck, added in 0.2.8. See the release notes.