Cache - Githubissues

bralbral commented 3 years ago

Hello!

According to the documentation: By default, any adhoc query passed to JDBC bridge will be executed twice. The first run is for type inferring, while the second for retrieving results. Although metadata will be cached(for up to 5 minutes by default), executing same query twice could be a problem - that's where schema comes into play.

I do not understand what parameter needed to increase?

In my case, chema never changes. Can I use "caching" queries like select * from table limit 10 periodically to get the schema in the cache?

Did I understand correctly that if select * - selects all columns then this cache will be valid forselect col1, col2?

Can this reduce this overhead?

Best regards!

zhicwu commented 3 years ago

Sorry the cache is not as sophisticated as you expected. It's just a mapping between SQL(case-sensitive String) and column types(list of objects), meaning select * from table and select col1, col2 from table are two separate entries in the cache, even there are overlaps.

As to the type inferring overhead, I think it's a limitation of current XDBC bridge design. Let me walk you through the query process:

You send query select * from jdbc('mydb', 'select * from table') to ClickHouse
ClickHouse first checks JDBC bridge availability by sending a request to <jdbc-bridge>/ping
JDBC bridge responds Ok. to proceed
ClickHouse then sends another request to <jdbc-bridge/columns to understand column types of select * from table
JDBC bridge issues the query(both fetchSize & maxRows are set to 1) and return serialized column types
ClickHouse sends a separate request with query and column types together to <jdbc-bridge/ in order to retrieve query results
JDBC bridge issues the same query again(without setting fetchSize and maxRows to 1) and return results in RowBinary format

Step 5 is the overhead we're talking about. Caching column types helps to prevent from executing same query over and over again in short period of time, but to be honest it's not ideal. You should consider named query in a production-like environment, for example:

optionally create <jdbc-bridge-directory>/scripts/my_query.sql like below:
```
select * from table
```

create config/queries/query_table.json with below content

{
"query_table": {
  "query": "scripts/my_query.sql",
  "columns": [
    {
      "name": "col1",
      "type": "String",
      "nullable": false
    },
    {
      "name": "col2",
      "type": "UInt32",
      "nullable": true
    }
  ]
}
}

Note: change query property to select * from table if you didn't create my_query.sql.

change your query to select * from jdbc('mydb', 'query_table')

bralbral commented 3 years ago

I got it. Thank you!

ClickHouse / clickhouse-jdbc-bridge

Cache #110