ClickHouse / clickhouse-js

Official JS client for ClickHouse DB
https://clickhouse.com
Apache License 2.0
189 stars 22 forks source link

What is the best way to trigger a command and polling until is done by query_id? #244

Closed francescorivola closed 4 months ago

francescorivola commented 4 months ago

Hi ClickHouse Js Team,

I am writing to ask you what is your recommendation to achieve the following use case:

This command takes a quite large amount of time (maybe several minutes or more). We do not want to keep the client waiting for the response, instead just take the query_id and periodically check for system.query_log and check when the query is completed.

A possible solution could set a small request_timeout setting but that will likely require to create an instance of the Client for each query done for this process and I not sure is a good approach (However something similar we are doing right now in production with the lib https://github.com/TimonKK/clickhouse). The number of these operations in production are around 2K commands every hour.

Thank you in advance for your support and guidance.

slvrtrn commented 4 months ago

Can you please share the query?

slvrtrn commented 4 months ago

If it's a simple INSERT INTO query with an array, then maybe you could try async inserts (with disabled waiting on the client side).

const client = createClient({
  clickhouse_settings: {
    async_insert: 1,
    wait_for_async_insert: 0, // <- don't wait for ACK
  },
})

await client.insert({
  table,
  format: 'JSONEachRow', // or other, depending on your data
  values: [{ ... }],
})
francescorivola commented 4 months ago

Hi @slvrtrn , the use case is an INSERT INTO SELECT. So we use the query result between other tables to populate a target table. Do you need more info?

Thank you so much

slvrtrn commented 4 months ago

It looks like suspending an INSERT FROM SELECT statement might not be currently possible; see this issue in the main repo.

I'd be curious to learn more about your use case cause maybe materialized views will do.

francescorivola commented 4 months ago

The use case is quite tricky, probably not the most common use case for ClickHouse users hehehe. Unfortunately I cannot enter too much in detail, I am sorry. But in the end we basically use ClickHouse to compute a dynamic result set (our customer can define dynamic filters and queries), this result is then stored in a table in ClickHouse to then be consumed as a stream by our transactional database. We store the result in a ClickHouse table instead of consuming directly the query result to allow the process to resume in case of an application graceful shutdown, the result set is stored in a proper order and can be consumed at any time resuming by the last processed row.

Do you see any potential issue in create a client and use request_timeout for this process? We can create and close the client once the query is done so it should not leak resources. Of course we may lost the benefit of keep alive feature by recreating a client every time.

slvrtrn commented 4 months ago

If materialized views are not an option, then yes, a fairly low request_timeout with a proper timeout error handler might be an option. You'd have to follow up by checking system.processes or system.query_log. The insert will keep running.

slvrtrn commented 4 months ago

I think a simpler approach is just aborting the insert HTTP request using AbortController and see if it also works. No need to recreate the client every time. See this example.

francescorivola commented 4 months ago

Yes materialized view it is not an option unfortunately. We explored this in the past. Right now we are doing so in production. The library I mentioned above is based on request.js and we use timeout option than handle the ESOCKETTIMEDOUT error. Then we poll system.query_log periodically to know when the query is done and start the streaming phase to consume the result of the query from the staging table.

I have tried before the AbortController, but let me try it again one more time as I had some challenge to setup a proper integration test for this. Now I think I have something more solid to test it again. I will let you know the result in a bit. Thank you.

slvrtrn commented 4 months ago

You can wrap the cancellation into a setInterval that checks system.processes or system.query_log with your query_id (which you should generate in advance as a UUID and provide to the insert params) to see if it has landed there properly; if it is there, then both interval and the insert operation are "kinda safe" to cancel.

francescorivola commented 4 months ago

Awesome, I think is working like a charm.

Here is my proof of concept:

image

image

The query in the test does not look like a real one (in the real one we already have the table created) but it allows me to test the mechanism quite well.

I thought that the abort controller was also aborting the query in ClickHouse server, preventing the insert to complete properly. But it looks like I was wrong. Can you confirm that abort is never going to abort the query?

Reading the driver doc sounds like sometimes could abort the operation but I guess is only if the request is aborted before does reach the server (https://clickhouse.com/docs/en/integrations/language-clients/javascript#insert-streaming-in-nodejs): image

slvrtrn commented 4 months ago

FWIW, the signal from the controller is only specified to the Node.js HTTP request, that's it. It should only cancel the outgoing HTTP request. We are not sending anything extra to ClickHouse in that case.

Regarding the info note, "non-read-only" queries should not be canceled (like your INSERT FROM SELECT), and "read-only" queries might be canceled if you enable a certain setting. IIRC, it should be this one.

slvrtrn commented 4 months ago

In your code, to overcome potential network delays (and maybe make it a bit faster if you choose a smaller poll value instead of a fixed large enough timeout), I'd recommend checking on an interval anyway, canceling the interval handle if the query_id is available in the system.query_log already. I think it will be more stable than a single setTimeout.

francescorivola commented 4 months ago

That's is a very good idea. We can abort once the query id is in the system.query_log.

Thank you very much @slvrtrn , you are helping a lot. We are looking forward to complete the migration of the driver and land it to production :), I think we are very close now.

Thank you again, I am closing the issue.

slvrtrn commented 4 months ago

Side note: in your tests, you can simulate long-running queries with just the sleep(N) and sleepEachRow(N) functions, where N is how many seconds of delay you need (up to 3 seconds). It will allow your tests to consume fewer resources and not rely on the potentially flaky execution time of the data insertion queries.

francescorivola commented 4 months ago

Hi @slvrtrn , I was not aware of the sleep function. It is indeed much better than my current test approach (in fact I was not sure to finally commit this test into the repo). Thank you so much.