ClickHouse / clickhouse-jdbc-bridge

A JDBC proxy from ClickHouse to external databases
Apache License 2.0
167 stars 60 forks source link

Cache #110

Closed bralbral closed 2 years ago

bralbral commented 3 years ago

Hello!

According to the documentation: By default, any adhoc query passed to JDBC bridge will be executed twice. The first run is for type inferring, while the second for retrieving results. Although metadata will be cached(for up to 5 minutes by default), executing same query twice could be a problem - that's where schema comes into play.

I do not understand what parameter needed to increase?

In my case, chema never changes. Can I use "caching" queries like select * from table limit 10 periodically to get the schema in the cache?

Did I understand correctly that if select * - selects all columns then this cache will be valid forselect col1, col2?

Can this reduce this overhead?

Best regards!

zhicwu commented 3 years ago

Sorry the cache is not as sophisticated as you expected. It's just a mapping between SQL(case-sensitive String) and column types(list of objects), meaning select * from table and select col1, col2 from table are two separate entries in the cache, even there are overlaps.

As to the type inferring overhead, I think it's a limitation of current XDBC bridge design. Let me walk you through the query process:

  1. You send query select * from jdbc('mydb', 'select * from table') to ClickHouse
  2. ClickHouse first checks JDBC bridge availability by sending a request to <jdbc-bridge>/ping
  3. JDBC bridge responds Ok. to proceed
  4. ClickHouse then sends another request to <jdbc-bridge/columns to understand column types of select * from table
  5. JDBC bridge issues the query(both fetchSize & maxRows are set to 1) and return serialized column types
  6. ClickHouse sends a separate request with query and column types together to <jdbc-bridge/ in order to retrieve query results
  7. JDBC bridge issues the same query again(without setting fetchSize and maxRows to 1) and return results in RowBinary format

Step 5 is the overhead we're talking about. Caching column types helps to prevent from executing same query over and over again in short period of time, but to be honest it's not ideal. You should consider named query in a production-like environment, for example:

bralbral commented 3 years ago

I got it. Thank you!