apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

Apache Arrow Flight Server (Data as a Service) #44135

Closed Susmit07 closed 3 weeks ago

Susmit07 commented 1 month ago

I am planning to build a arrow flight server on top of data lying in s3, the s3 data storage has petabyte of data.

I have few concerns when the flight server loads all the 1 PB of data in-memory it should lead to Out of memory

Secondly when we have incremental updates or new data arrives how we refresh the data ?

Thirdly we are planning to deploy into multiple nodes in a k8s cluster then every node will maintain a copy of data, I am concerned on such a huge data loaded in-memory and ideal way to sync data

Component(s)

FlightRPC

zeroshade commented 1 month ago

Most likely consumers of this service wouldn't be retrieving the entire PB of data for each request right? They'd be requesting some subset of the data, correct?

Thus the flight service would simply be a go-between to centralize access and logic for retrieving the data from S3, while potentially providing a cache. Each service would never load the entire dataset in memory, it would just stream portions of it, possibly caching bits and pieces to avoid calling all the way out to S3.

Secondly when we have incremental updates or new data arrives how we refresh the data ?

You'd have to perform cache invalidation for any caches in the services (the simplest way might be to just do a Stat check on the relevant data in S3 and use the cache if the "last modified timestamp" matches the cache) and just pull it directly from S3 when it's been updated and the local cache is out of date.

Thirdly we are planning to deploy into multiple nodes in a k8s cluster then every node will maintain a copy of data, I am concerned on such a huge data loaded in-memory and ideal way to sync data

Same comment as before, you wouldn't pull the entire dataset as copies into each service instance. That's incredibly inefficient and unnecessary. The ideal way here is as I mentioned before, stream directly from S3 while optionally caching the data locally if needed.

Susmit07 commented 1 month ago

One option can be on-demand data retrieval in an Apache Arrow Flight server, and to handle scenarios where the data returned is potentially very large, fetch and stream data efficiently without overwhelming memory resources.

override def doGet(context: FlightHandler.DoGetContext, ticket: Ticket): FlightStream = {
    val schema = /* Retrieve schema */
    val stream = new ArrowStreamWriter(/* Output Stream */, allocator, schema)
    val data = fetchDataFromS3(ticket) // Implement this to fetch data from S3
    val reader = new ArrowStreamReader(data, allocator)
    while (reader.loadNextBatch()) {
      val batch = reader.getVectorSchemaRoot.getRowCount
      stream.writeBatch(batch)
    }
    stream.end()
    new FlightStream(stream)
  }

one downside we see is that everytime it makes an IO call to read from S3 .. but data will be read fresh

Do you see any other issue ?

zeroshade commented 1 month ago

one downside we see is that everytime it makes an IO call to read from S3 .. but data will be read fresh

That's why i suggested potentially having an LRU cache or otherwise to cache some datasets (not all!) somewhere (Redis, etcd, local memory, whatever) so that your most frequently requested data could avoid the IO call to s3.

Susmit07 commented 3 weeks ago

Thank you @zeroshade

zeroshade commented 3 weeks ago

@Susmit07 any unanswered questions here, or can we close this out?

Susmit07 commented 3 weeks ago

Just wanted to mention In-Memory Analytics with Apache Arrow by you is super good

Hope it reaches more audiences

zeroshade commented 3 weeks ago

Thanks for the praise! If you haven't checked it out yet, the 2nd edition of the book launched at the beginning of October! :smile:

Susmit07 commented 3 weeks ago

I will surely look into

@zeroshade I am a bit confused how to use the Apache Arrow Flight server for our use case

I tagged you in this discussion, and I have mentioned my use case too

https://github.com/apache/arrow-rs/discussions/6557#discussioncomment-10946777

But I feel writing my own Flight SQL server will involve lot of effort plus i need to write my own SQL parser

Is querying data using Flight SQL with S3 as a object store is a right use case or shall i explore other better options?