Currently IIUC all the segments need to be on disk to be able to query them.
Usually, queries to older data tends to be lower or clients would be okay with
in-curing latency for queries on older data.
Using this assumption, we could possibly keep the segments older than X days
on deep storage and not load them onto the servers.
Couple of ways to deal with this based on discussions on slack:
1) Using presto on top of pinto and make presto-pinot be able to query
segments on S3 directly.
2) When the query arrives lazy load the segment and return the results.
We can potentially build a storage hierarchy as DRAM -> SSD/HDD -> Remote store (Deep Store).
Segments are adaptively (based on the query pattern) brought onto the compute nodes (servers) and stored on local storage (SSD or HDD).
Furthermore, heavily queried (and based on size) segments can be cached completely in DRAM on the servers. This would be helpful for systems that don't have SSD and thus paging overhead for memory mapped segments is non-trivial as compared to SSD.
Currently IIUC all the segments need to be on disk to be able to query them. Usually, queries to older data tends to be lower or clients would be okay with in-curing latency for queries on older data.
Using this assumption, we could possibly keep the segments older than X days on deep storage and not load them onto the servers.
Couple of ways to deal with this based on discussions on slack: 1) Using presto on top of pinto and make presto-pinot be able to query segments on S3 directly. 2) When the query arrives lazy load the segment and return the results.