Closed Susmit07 closed 3 weeks ago
Most likely consumers of this service wouldn't be retrieving the entire PB of data for each request right? They'd be requesting some subset of the data, correct?
Thus the flight service would simply be a go-between to centralize access and logic for retrieving the data from S3, while potentially providing a cache. Each service would never load the entire dataset in memory, it would just stream portions of it, possibly caching bits and pieces to avoid calling all the way out to S3.
Secondly when we have incremental updates or new data arrives how we refresh the data ?
You'd have to perform cache invalidation for any caches in the services (the simplest way might be to just do a Stat check on the relevant data in S3 and use the cache if the "last modified timestamp" matches the cache) and just pull it directly from S3 when it's been updated and the local cache is out of date.
Thirdly we are planning to deploy into multiple nodes in a k8s cluster then every node will maintain a copy of data, I am concerned on such a huge data loaded in-memory and ideal way to sync data
Same comment as before, you wouldn't pull the entire dataset as copies into each service instance. That's incredibly inefficient and unnecessary. The ideal way here is as I mentioned before, stream directly from S3 while optionally caching the data locally if needed.
One option can be on-demand data retrieval in an Apache Arrow Flight server, and to handle scenarios where the data returned is potentially very large, fetch and stream data efficiently without overwhelming memory resources.
override def doGet(context: FlightHandler.DoGetContext, ticket: Ticket): FlightStream = {
val schema = /* Retrieve schema */
val stream = new ArrowStreamWriter(/* Output Stream */, allocator, schema)
val data = fetchDataFromS3(ticket) // Implement this to fetch data from S3
val reader = new ArrowStreamReader(data, allocator)
while (reader.loadNextBatch()) {
val batch = reader.getVectorSchemaRoot.getRowCount
stream.writeBatch(batch)
}
stream.end()
new FlightStream(stream)
}
one downside we see is that everytime it makes an IO call to read from S3 .. but data will be read fresh
Do you see any other issue ?
one downside we see is that everytime it makes an IO call to read from S3 .. but data will be read fresh
That's why i suggested potentially having an LRU cache or otherwise to cache some datasets (not all!) somewhere (Redis, etcd, local memory, whatever) so that your most frequently requested data could avoid the IO call to s3.
Thank you @zeroshade
@Susmit07 any unanswered questions here, or can we close this out?
Just wanted to mention In-Memory Analytics with Apache Arrow by you is super good
Hope it reaches more audiences
Thanks for the praise! If you haven't checked it out yet, the 2nd edition of the book launched at the beginning of October! :smile:
I will surely look into
@zeroshade I am a bit confused how to use the Apache Arrow Flight server for our use case
I tagged you in this discussion, and I have mentioned my use case too
https://github.com/apache/arrow-rs/discussions/6557#discussioncomment-10946777
But I feel writing my own Flight SQL server will involve lot of effort plus i need to write my own SQL parser
Is querying data using Flight SQL with S3 as a object store is a right use case or shall i explore other better options?
I am planning to build a arrow flight server on top of data lying in s3, the s3 data storage has petabyte of data.
I have few concerns when the flight server loads all the 1 PB of data in-memory it should lead to Out of memory
Secondly when we have incremental updates or new data arrives how we refresh the data ?
Thirdly we are planning to deploy into multiple nodes in a k8s cluster then every node will maintain a copy of data, I am concerned on such a huge data loaded in-memory and ideal way to sync data
Component(s)
FlightRPC