A database client should be based on asyncio

elrik75 commented 1 year ago

Major DBs have their asyncio client nowdays:

postgresql: asyncpg
redis: aioredis
kafka: aiokafka
etc

It's logical for any I/O access for python to be async. Note that there is an unmaintained async client for clickhouse: https://github.com/maximdanilchenko/aiochclient Having an official async client for clickhouse with cython and full-featured will be great !

genzgd commented 1 year ago

Agreed, this should have had its own enhancement request/issue before now.

pkit commented 1 year ago

@elrik75 No, it shouldn't yet. asyncio is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.

vsbaldeev commented 1 year ago

Hello everyone! As I understand official aio client for python is not developed yet. So, which programming language and library should I use to make asynchronous requests to clickhouse ?

genzgd commented 1 year ago

The await/async keywords are part of the Python standard asyncio library and work fine with clickhouse-connect. The main caveat is that ClickHouse doesn't allow concurrent queries in the same session, so you should disable the autogenerate_session_id common setting before calling get_client. See the explanation here: https://clickhouse.com/docs/en/integrations/python#managing-clickhouse-session-ids

genzgd commented 1 year ago

To clarify, I see this issue as an enhancement request to either create a clean async wrapper around the library that feels more "async native" or even better, use a true async HTTP client. The urllib3 library doesn't use any async io so in that sense clickhouse-connect doesn't really do things in a real event driven way. But you can still use clickhouse-connect with async code.

araa47 commented 1 year ago

+1 , feel like this is a required feature for a modern python library

chrisprobst commented 7 months ago

@elrik75 No, it shouldn't yet. asyncio is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.

Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?

zerafachris commented 6 months ago

Hi @elrik75 @araa47 @genzgd @vsbaldeev , there seems to be a package which wraps on the old clickhouse driver here https://github.com/long2ice/asynch . Maybe this could help?

pkit commented 5 months ago

Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?

See here: https://stackoverflow.com/questions/46827007/runtimeerror-this-event-loop-is-already-running-in-python

Essentially brain-dead is the "all or nothing" implementation. No project can be converted to async ever without a complete rewrite of that project and all of the libraries it uses. And the libraries that use these libraries. No nested tasks are allowed, no gradual I/O conversion, full rewrite only.

elrik75 commented 5 months ago

Essentially brain-dead is the "all or nothing" implementation.

That's correct. The point is more: what kind of DB client should CH support then? sync or async? What kind makes more sense? My point of view (because I work on backend servers) is that a sync IO lib is not very helpful.

genzgd commented 5 months ago

@alexted We in the process of defining the roadmap for clickhouse-connect and other integrations, but for the moment this work is not planned for the immediate future. The current thinking is that true async support would involve swapping out to the httpx HTTP client with integrated async support (instead of the venerable urllib3), but that is of course a fairly major change.

To be clear, this is definitely work we want to do, but it may take a while to come to the top of the priority list given all of the work on other projects. As always, community contributions to help us get moving in the right direction are always appreciated. :)

samuelcolvin commented 5 months ago

In case anyone is struggling with this, if the postgres connection port is open, you can connect to clickhouse with psycopg3:

    conn = await psycopg.AsyncConnection.connect(
        dbname='default',
        user='default',
        password='...',
        host='localhost',
        port='9005',
        cursor_factory=psycopg.AsyncClientCursor,
        autocommit=True,
    )

You need cursor_factory=psycopg.AsyncClientCursor, autocommit=True to get psycopg to use the simple postgres protocol since clickhouse doesn't support the extended protocol, for the same reason you can't use asyncpg.

All data types seem to be returned as strings, but it's better than nothing (and pydantic does a pretty good job of coercing data to an expected type)

genzgd commented 5 months ago

I've added an example of how to run clickhouse-connect queries asynchronously (it includes a semaphore just for fun as well). Based on very limited testing, this solution still significantly outperforms other ClickHouse "async" libraries. You're still running on a single Python thread at a time, but the HTTP requests are obviously very much I/O bound and the Cython transform code is extremely fast, so the GIL isn't as big a problem as you might think.

If you do experiment with similar solutions, please report your results (good or bad) here.

slvrtrn commented 4 months ago

Since 0.7.16, ClickHouse-Connect provides a convenience AsyncClient wrapper over the standard Client, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.

samuelcolvin commented 4 months ago

Since 0.7.16, ClickHouse-Connect provides a convenience AsyncClient wrapper over the standard Client, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.

This is very unfortunate. As I think I said to @tbragin, pretending to have any async client by just running the sync code in a thread pool is an own goal for ClickHouse. It'll often be slower than just using the sync client. You'd be better off either recommending loop.run_in_executor or the HTTP API and httpx.

https://github.com/ClickHouse/clickhouse-connect/blob/ee3af795dcbfdc372facd944cf2212afb962ef16/clickhouse_connect/driver/asyncclient.py#L79-L88

To actually solve this properly, you need to either:

implement a real async client from the ground up
or (preferable) support Arrow Flight SQL in ClickHouse, and then rely on a real high quality Arrow Flight SQL client library in Python, there's already a Rust client for Arrow Flight SQL, so building a Python one should be pretty easy

mshustov commented 4 months ago

@samuelcolvin, As you can see, the issue is not closed. We will support asyncio natively, however the implementation will take time. AsyncClient wrapper is created to close the gap already today

genzgd commented 4 months ago

@samuelcolvin Putting aside the Arrow Flight question for the moment, I'm not sure what problem you are trying to "solve". Queries run using the wrapper yield the asyncio event loop when waiting for the HTTP I/O. So your main thread will spend its time in the CPU bound parts of the query (generally, transforming data). The fact that urllib3 itself doesn't have an async API doesn't change how the asyncio event loop behaves while waiting for network data.

I can assure you that using clickhouse-connect with the async wrapper will outperform any code using a client based on httpx and a text format like JSON or CSV. So if you are getting an async API that yields the main event loop thread while waiting for I/O, and has high performance, what exactly is missing?

ClickHouse / clickhouse-connect

A database client should be based on asyncio #141