apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Arrow Flight Endpoint for Pinot #6921

Open siddharthteotia opened 3 years ago

siddharthteotia commented 3 years ago

In addition to being used as the in-memory and wire columnar format in few compute engines, Arrow is also commonly used for data sharing between JVM and non JVM systems without SerDe overhead. So python users working with Pandas and other analytical libraries can consume arrow in-memory format generated by JVM based engine.

See this example on how PySpark uses Arrow - https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow

Arrow flight is the optimized wire protocol for network transfer of columnar record batches (think of as alternative to JDBC and ODBC protocol). The wire format is same as in-memory format. So when both endpoints are using Arrow, Flight protocol can be used to efficiently send result data from Pinot as Arrow record batches to say a Python client which can continue to do additional processing on it.

lfeagan commented 3 years ago

Agreed. I am looking for any binary format that provides more efficient transfer than JSON over HTTP/1.1. Arrow is a good choice. Avro over gRPC would also do well based on my performance evaluations. Even simply adding support for a different content type than "application/json", such as avro, over the existing HTTP socket would be nice and not particularly difficult to achieve.

kishoreg commented 3 years ago

Is Arrow a specification or an implementation? Pinot already has a lot of techniques available in Arrow. I am thinking if it's a better idea to implement Arrow spec on top of Pinot datatable.

@lfeagan Pinot servers already support a GRPC streaming endpoint using a pinot data table and is used in Trino-Pinot connector

lfeagan commented 3 years ago

In this case, I am referring to Arrow as a specification for a binary format to represent data, rather than an implementation. I am looking for support for reading data as a client. In our use case, we have a large number of industrial sensors where we regularly run complex analytics jobs via spark, pulling out days or months of data. For example, analyzing how wind direction is impacting leaks at manufacturing plants and making inferences about the source or extrapolating if EPA rules will be violated in any area in the future.

mayankshriv commented 3 years ago

Have seen JSON response causing ser/de related latencies when handling large responses. Within Pinot, we already have a compact binary formats for data translation between Pinot broker and servers. We can also have Pinot provide response compact binary format (Apache Arrow as specification, or otherwise), with client-side library to deserialize the response.

@siddharthteotia were you planning to take this up?

siddharthteotia commented 3 years ago

Sorry, I somehow missed this discussion thread.

@kishoreg @mayankshriv @lfeagan - Arrow is a specification rather than implementation (has mulitple programming lang impls).

My main intention behind this issue was from database connectivity point of view. So like JDBC, ODBC, this will be another connector to Pinot and the binary protocol will be based on Arrow spec and the transport will be via GRPC. This was not necessarily for server to broker communication but instead client to broker transport where the clients can just get the PinotResultSet as Arrow buffers from the wire into memory without copy / SerDe and continue further processing on Arrow data -- common scenario in Python / Data Science world from what I can see.

Moving server to broker DataTable format to Arrow spec can also be considered

PS - this is also not related to underlying disk/memory format being based on Arrow. As you mentioned, our format is already well optimized for the engine.