apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.59k stars 786 forks source link

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

Open nathanielc opened 6 days ago

nathanielc commented 6 days ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am implementing a flight sql server using datafusion. See this logic that simply reports the flight_info schema as the result of the query schema.

The FlightDataEncoder has two modes for dictionary handling. In one mode it hydrates dictionaries thus changing the schema of the data during transport. The flight sql server needs to reflect the hydrated schema otherwise clients will be confused as the data received will not match the reported schema.

Describe the solution you'd like

A simple solution would be to make this function public API so it can be reused. Describe alternatives you've considered

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

Additional context

My solution for now is to copy to the logic into the server implementation. I'd be happy to submit a PR to make the function public if that is what we think is a good solution.

alamb commented 3 days ago

How about creating a FlightDataEncoder to encode an empty stream and then read the schema off the stream

let empty_stream = FlightDataEncoderBuilder::new()
  .with_schema(pre_encoded_schema)
  .build(streams::iter(vec![]));
let schema = empty_stream.schema();

If that works, perhaps we can add an example to the documentation

I would be hesitant to just make prepare_schema_for_flight public as it seems somewhat brittle as the arguments need to remain in sync with however the FlightDataEncoder is constructed, but it uses different types

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

FWIW the logical type idea will likely remain in DataFusion as there is no concept of LogicalType in the Arrow type system (for better / worse)

nathanielc commented 2 days ago

@alamb Agreed, exposing the API is a fragile solution.

I like your proposed approach however the FlightDataEncoder type does not expose a method to access the schema. However that would be a small addition to its API. Should we add a the function

 pub fn schema(&self) -> Option<SchemaRef> {
     self.schema.clone()
 }

In cases where the schema is known upfront it will have been hydrated and in cases where its not known upfront a None is returned. Thoughts? Maybe we call the function known_schema to make it clear its only available when the schema is known upfront?

alamb commented 2 days ago

Makes sense to me!