ClickHouse / clickhouse-connect

Python driver/sqlalchemy/superset connectors
Apache License 2.0
332 stars 65 forks source link

Concurrent queries being limited #407

Closed gabrielmcg44 closed 1 month ago

gabrielmcg44 commented 1 month ago

Hey folks, not sure if this is an issue or there is a configuration I should be using.

But I was testing concurrent queries using clickhouse-connect. For it, I am running multiple queries to sleep 3 seconds concurrently.

I run 100 of these in parallel and I was expecting a time close to 3 seconds. But after running the code I noticed that the queries are started all together, but computed in blocks of 12 for some reason and it takes around 30 seconds.

I used to use another client and all of the queries seemed to run in parallel when I did the same test with it.

Here is the test I am running:

async def run_sleep_query(ch_client, sleep_time, id):
    print(f"starting query {id} at {time.time() % 1000}")

    statement = textwrap.dedent(
        f"""
        SELECT sleep({sleep_time}) as sleep
        """
    ).strip()

    result = await ch_client.query(statement)
    print(f"Query {id} finished at {time.time() % 1000}")
    return result

async def main():

    sleep_times = []
    for i in range(100):
        sleep_times.append((i, 3))

    connection = ClickHouseConnectConnection.from_environment()

    ch_client = await get_async_client(
        host=connection.host,
        user=connection.username,
        password=connection.password,
        port=connection.port_http,
        database=connection.database,
        session_id=connection.session_id,
        connect_timeout=5,
    )
    start = time.time()
    # Run queries in parallel
    results = await asyncio.gather(*[run_sleep_query(ch_client, sleep_time, id) for id, sleep_time in sleep_times])

    # Print results
    for i, sleep_result in enumerate(results):
        print(f"Query {i+1} finished with sleep result")

    end = time.time()
    print(f"Total time: {end - start:.2f}s")

if __name__ == "__main__":
    asyncio.run(main())

I also tried to workaround this by creating a class with multiple clients and randomly choosing one of them for each query, but there was no difference at all.

genzgd commented 1 month ago

Under the covers, the AsyncClient uses a ThreadPoolExecutor to actually run the async requests. That means each client query in fact runs in a separate thread. In cases where the thread is mostly waiting around for I/O (like your example where each thread will sleep for 3 seconds), this is just fine.

However, in cases where the thread is doing real CPU work, such as preparing data for insert, converting incoming data from Native format to Python objects, or application processing of the data after it is retrieved from ClickHouse, in theory having multiple threads doing heavy CPU operations could bring the whole machine/container to a crawl (this is unlikely to happen unless you are running multiple processes, since Python is essentially restricted to a single thread by the famous GIL -- global interpreter lock).

Accordingly, there are always a maximum number of threads in the default ThreadPoolExecutor. That number is set to the number of detected cores on your machine + 4. So if you have a 8 core machine, Python will only assign async work to 12 threads at a time. One of those threads must free up before Python will assign another thread to one of your queries.

In the next patch release we'll add an option to make the number of worker threads configurable, but in general it's not a great idea to increase that to a large number to prevent the clickhouse-connect application from starving the machine. However in a use case where clickhouse-connect is doing very little real work (just orchestrating long running ClickHouse queries, for example), it will be safe to increase that number once the option is available.

gabrielmcg44 commented 1 month ago

this make perfect sense. Is this threadpool shared between multiple clients? Trying to understand why creating multiple clients and randomly assigning tasks did not work

genzgd commented 1 month ago

Yes, the current code uses the shared default Executor for all AsyncClients. In the next release each client will have its own executor. That's slightly risky, as then multiple clients could end up with the same thread starvation issue, but I think it's reasonable to let the user tune each client separately.

gabrielmcg44 commented 1 month ago

cool! thanks for the clarifications!