ClickHouse / clickhouse-js

Official JS client for ClickHouse DB
https://clickhouse.com
Apache License 2.0
189 stars 22 forks source link

Cannot read bytes from String or FixedString(N) #282

Open c0mm4nd opened 1 month ago

c0mm4nd commented 1 month ago

Describe the bug

Directly select a String or FixedString(N) column which stores raw bytes rather than UTF-8 string, the result will contain

e.g. image

the correct value of this should be the hash unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b')

Steps to reproduce

  1. Store raw hash or any other raw bytes in Clickhouse DB
  2. Directly select it through clickhouse-js, formatting as any CSV or JSON
  3. convert the string to bytes
                                    function stringToBytes(str) {
                                      const bytes = [];
                                      for (let i = 0; i < str.length; i++) {
                                        const codeUnit = str.charCodeAt(i);
                                        bytes.push(codeUnit);
                                      }
                                      return bytes;
                                    }
  4. the value is incorrect and ruined by 65533

Expected behaviour

return a correct string or bytes

Code example

as shown in Steps to reproduce

Error log

Configuration

Environment

ClickHouse server

slvrtrn commented 1 month ago

This is because the client's ResultSet text and JSON decoders assume a UTF-8 string inside that field (but it is not a UTF-8 string in this specific case).

For now, a passable workaround could be to just convert it back to bytes:

import { createClient } from '@clickhouse/client' // or '@clickhouse/client-web'

void (async () => {
  const client = createClient()
  const resultSet = await client.query({
    query: `SELECT unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b') AS x`,
    format: 'JSONEachRow',
  })
  const result = await resultSet.json<{ x: string }>()
  console.info('Result:', new TextEncoder().encode(result[0].x))
  await client.close()
})()
Result: Uint8Array(58) [
    0,   0,   0, 112, 239, 191, 189,  80, 239, 191,
  189,  64, 239, 191, 189,  41, 126, 239, 191, 189,
  239, 191, 189,  33, 239, 191, 189, 239, 191, 189,
   99,  29, 239, 191, 189, 239, 191, 189, 239, 191,
  189,  90,  52,  98, 239, 191, 189, 239, 191, 189,
  239, 191, 189,  26, 121, 114, 111, 107
]
slvrtrn commented 1 month ago

Another solution (which looks like a more correct one in this case, cause the previous result set looks a bit off) is just to ask ClickHouse to return a byte array instead so that conversion happens on the server side:

WITH '000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b' AS str
SELECT arrayMap(x -> reinterpretAsUInt8(x), splitByString('', unhex(str))) AS bytes

Result:

   ┌─bytes─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ [0,0,0,112,133,80,243,64,161,41,126,239,231,33,163,176,99,29,141,196,204,90,52,98,171,174,239,26,121,114,111,107] │
   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

And that will be decoded with JSON formats as just an array of numbers, which can be then used to create a UInt8Array in Node.

@c0mm4nd, can you please check if it works in your scenario? Then I can add an example and close this.

c0mm4nd commented 1 month ago

In the first passable solution, the hash value length is 58 rather than 32, due to the invalid unicode 65533 has already ruined the result. I cannot convert the 65533 back to the correct byte. The second works, but it requires rewriting the SQL, and the bytes-parse work is not efficient when handling tons of hash results. So these two are not what I want.

Python client clickhouse-connect also works on HTTP port, but it doesn't have such a problem.

slvrtrn commented 1 month ago

Another option is to return the fixed string encoded as base64, which will require rewriting the queries a bit.

For example:

import { createClient } from '@clickhouse/client'

void (async () => {
  const client = createClient()
  const resultSet = await client.query({
    query: `SELECT base64Encode(unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b')) AS x`,
    format: 'JSONEachRow',
  })
  const result = await resultSet.json<{ x: string }>()
  const buffer = Buffer.from(result[0].x, 'base64')
  console.info('Base64:', result[0].x)
  console.info('Buffer:', buffer, 'with length:', buffer.length)
  await client.close()
})()
Base64: AAAAcIVQ80ChKX7v5yGjsGMdjcTMWjRiq67vGnlyb2s=
Buffer: <Buffer 00 00 00 70 85 50 f3 40 a1 29 7e ef e7 21 a3 b0 63 1d 8d c4 cc 5a 34 62 ab ae ef 1a 79 72 6f 6b> with length: 32

Python client clickhouse-connect also works on HTTP port, but it doesn't have such a problem.

I'd assume it works there because it uses RowBinary (so that custom type and column mappers can be provided), which has not yet been implemented in this client. See this issue for tracking; I had to put it on hold due to other duties, as this feature is massive.