GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
653 stars 38 forks source link

Caching Frequently used Parquet sections #2671

Open reisepass opened 7 months ago

reisepass commented 7 months ago

Description

On the topic of caching in glaredb has it already implemented caching of frequently queried parquet blocks in memory or in fast disk near the compute. In the java world external blob storage caching systems exist like https://github.com/Alluxio/alluxio which then can provide in memory access directly to the Spark or Trino processes.

Starburst also has good caching build into their cloud version of Trino. It is almost fast enough to use it as a back-end for REST api's pulling data from parquet but still the Java overhead is hard to swallow when you can accomplish this so simply with pyarrow.

Context: We are looking for a solution to enable efficient small queries from large numbers of concurrent read only users without the need of copying the data once again to postgres/clickhouse

universalmind303 commented 7 months ago

related to https://github.com/GlareDB/glaredb/issues/1791

tychoish commented 7 months ago

I've been talking with the team about using caching for this kind of thing, but we haven't had a user doing this kind of workload. I think there are (or could be) a lot of options here, depending on what you're trying to do, at varying levels of complexity/work. Would love to hear more about what you're building! Feel free to drop some time on my calendar if you like.