Introduce a source data cache layer

yahoNanJing commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In a cloud native architecture with completely stateless executors, each executor needs to fetch the source data from the remote storage. If the amount of source data is very large, it will easily meet the network throughput bottleneck and it will take too much time for this step of fetching source data. For example, for a compute layer with 20 nodes with 10Gb node bandwidth, it will take at least 4s to fetch 100GB source data. While it often takes less than 1s to finish other steps of processing the 100GB data. Therefore, it’s better to introduce a cache layer into the cloud native architecture to make the executors be of weak state for caching hot data on local disk, like snowflake does.

Describe the solution you'd like

https://docs.google.com/document/d/1iMFv3S-TuiwBoTzp4KX0Ltrrenm86ULr0q_PwIKdW6g/edit?usp=sharing

To achieve this goal, we need to finish the following tasks:

[x] (Executor) Introduce a 2-tiered cache manager for caching data on remote storage based on local memory and disk
- [x] #652
- [x] #825
- [x] #827
[x] (Scheduler) #830
[ ] (Scheduler) #833
[x] (Scheduler) #648, #649

Describe alternatives you've considered

Additional context

collimarco commented 1 year ago

I am definitely interested in this feature, thanks for posting this. I came to the exact same conclusions while testing on large datasets stored on S3: the approach suggested here is definitely the best. The bandwidth is the bottleneck and sending the requests to the same executors, that cache on disk, using an hashing algorithm, is definitely the best solution.

BertHartm commented 4 months ago

I see that the only unchecked box (#833) has been merged. Does that mean this work is complete? or is there more to be done to achieve the goal?

apache / datafusion-ballista

Introduce a source data cache layer #645