Open elrik75 opened 1 year ago
Agreed, this should have had its own enhancement request/issue before now.
@elrik75 No, it shouldn't yet.
asyncio
is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.
Hello everyone! As I understand official aio client for python is not developed yet. So, which programming language and library should I use to make asynchronous requests to clickhouse ?
The await/async
keywords are part of the Python standard asyncio
library and work fine with clickhouse-connect
. The main caveat is that ClickHouse doesn't allow concurrent queries in the same session, so you should disable the autogenerate_session_id
common setting before calling get_client
. See the explanation here: https://clickhouse.com/docs/en/integrations/python#managing-clickhouse-session-ids
To clarify, I see this issue as an enhancement request to either create a clean async wrapper around the library that feels more "async native" or even better, use a true async HTTP client. The urllib3 library doesn't use any async io so in that sense clickhouse-connect
doesn't really do things in a real event driven way. But you can still use clickhouse-connect
with async code.
+1 , feel like this is a required feature for a modern python library
@elrik75 No, it shouldn't yet.
asyncio
is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.
Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?
Hi @elrik75 @araa47 @genzgd @vsbaldeev , there seems to be a package which wraps on the old clickhouse driver here https://github.com/long2ice/asynch . Maybe this could help?
Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?
Essentially brain-dead is the "all or nothing" implementation. No project can be converted to async ever without a complete rewrite of that project and all of the libraries it uses. And the libraries that use these libraries. No nested tasks are allowed, no gradual I/O conversion, full rewrite only.
Essentially brain-dead is the "all or nothing" implementation.
That's correct. The point is more: what kind of DB client should CH support then? sync or async? What kind makes more sense? My point of view (because I work on backend servers) is that a sync IO lib is not very helpful.
@alexted We in the process of defining the roadmap for clickhouse-connect
and other integrations, but for the moment this work is not planned for the immediate future. The current thinking is that true async support would involve swapping out to the httpx
HTTP client with integrated async support (instead of the venerable urllib3), but that is of course a fairly major change.
To be clear, this is definitely work we want to do, but it may take a while to come to the top of the priority list given all of the work on other projects. As always, community contributions to help us get moving in the right direction are always appreciated. :)
In case anyone is struggling with this, if the postgres connection port is open, you can connect to clickhouse with psycopg3:
conn = await psycopg.AsyncConnection.connect(
dbname='default',
user='default',
password='...',
host='localhost',
port='9005',
cursor_factory=psycopg.AsyncClientCursor,
autocommit=True,
)
You need cursor_factory=psycopg.AsyncClientCursor, autocommit=True
to get psycopg to use the simple postgres protocol since clickhouse doesn't support the extended protocol, for the same reason you can't use asyncpg.
All data types seem to be returned as strings, but it's better than nothing (and pydantic does a pretty good job of coercing data to an expected type)
I've added an example of how to run clickhouse-connect
queries asynchronously (it includes a semaphore just for fun as well). Based on very limited testing, this solution still significantly outperforms other ClickHouse "async" libraries. You're still running on a single Python thread at a time, but the HTTP requests are obviously very much I/O bound and the Cython transform code is extremely fast, so the GIL isn't as big a problem as you might think.
If you do experiment with similar solutions, please report your results (good or bad) here.
Since 0.7.16, ClickHouse-Connect provides a convenience AsyncClient
wrapper over the standard Client
, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.
Since 0.7.16, ClickHouse-Connect provides a convenience
AsyncClient
wrapper over the standardClient
, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.
This is very unfortunate. As I think I said to @tbragin, pretending to have any async client by just running the sync code in a thread pool is an own goal for ClickHouse. It'll often be slower than just using the sync client. You'd be better off either recommending loop.run_in_executor
or the HTTP API and httpx.
To actually solve this properly, you need to either:
@samuelcolvin, As you can see, the issue is not closed. We will support asyncio natively, however the implementation will take time. AsyncClient
wrapper is created to close the gap already today
@samuelcolvin Putting aside the Arrow Flight question for the moment, I'm not sure what problem you are trying to "solve". Queries run using the wrapper yield the asyncio event loop when waiting for the HTTP I/O. So your main thread will spend its time in the CPU bound parts of the query (generally, transforming data). The fact that urllib3
itself doesn't have an async API doesn't change how the asyncio event loop behaves while waiting for network data.
I can assure you that using clickhouse-connect
with the async wrapper will outperform any code using a client based on httpx
and a text format like JSON or CSV. So if you are getting an async
API that yields the main event loop thread while waiting for I/O, and has high performance, what exactly is missing?
Major DBs have their asyncio client nowdays:
It's logical for any I/O access for python to be async. Note that there is an unmaintained async client for clickhouse: https://github.com/maximdanilchenko/aiochclient Having an official async client for clickhouse with cython and full-featured will be great !