Closed oberalppaolof closed 2 months ago
Hey @aaronsteers can you assign me this one?
@avirajsingh7 - It's yours! Thanks for volunteering!
@aaronsteers is this correct idea to convert data into pandas then into arrow,
def get_arrow_table(self, stream_name: str) -> pa.Table:
table_name = self._read_processor.get_sql_table_name(stream_name)
engine = self.get_sql_engine()
df = pd.read_sql_table(table_name, engine, schema=self.schema_name)
return pa.Table.from_pandas(df)
or I should process data from table using _read_processor
@avirajsingh7 - I think that is a decent default implementation. However, this approach will require that the entire dataset fits into memory.
I've done some research on my side and I think what we want here is to return a pyarrow Dataset
object instead of a PyArrow Table
object.
For large datasets that will not fit into memory, I believe the Dataset
construct will allow us to break the entire table's data into chunks when necessary, avoiding crashing in scenarios where RAM is limited or the data size is just huge.
In terms of the actual implementation, I don't think we need to have very complex handing at this time. As long as the return type will allow us to refactor in the future, we should be good with an implementation that just puts the whole thing in memory, similar to your example above.
The Python example here shows a "zero copy" DuckDB version that returns a dataset object:
But for the generic implementation, we might just have to go through pandas.
@aaronsteers pd.read_sql_table has option for chunk_size,
We can get actual dataset in smaller pandas dataframe and creating a pyarrow_tables from them,
Later on returning pyarrow dataset to users,
def get_arrow_dataset(self, stream_name: str) :
table_name = self._read_processor.get_sql_table_name(stream_name)
engine = self.get_sql_engine()
chunks = pd.read_sql_table(table_name, engine, schema=self.schema_name,chunksize=500000)
first_chunk = next(chunks)
combined_schema = pa.Schema.from_pandas(first_chunk)
arrow_tables = []
pa.Table.from_pandas(first_chunk, schema=combined_schema)
for chunk in chunks:
# Convert each chunk to Arrow Table
chunk_arrow_table = pa.Table.from_pandas(chunk, schema=combined_schema)
arrow_tables.append(chunk_arrow_table)
dataset = pa.dataset.dataset(arrow_tables)
return dataset
Once we have dataset , users can access and manipulate the dataset as needed
What are your views on this approach?
@aaronsteers Raised PR for this one
Arrow is a crucial library and support it natively is a big opportunity. So we can avoid to use to_pandas().