Closed danthegoodman1 closed 1 year ago
Seems like they are available, and my completion was not working?
# return pyarrow table
def to_arrowTable(res):
"""convert res to arrow table"""
# try import pyarrow and pandas, if failed, raise ImportError with suggestion
try:
import pyarrow as pa
import pandas
except ImportError as e:
print(f'ImportError: {e}')
print('Please install pyarrow and pandas via "pip install pyarrow pandas"')
raise ImportError('Failed to import pyarrow or pandas') from None
if len(res) == 0:
return pa.Table.from_batches([], schema=pa.schema([]))
return pa.RecordBatchFileReader(res.bytes()).read_all()
# return pandas dataframe
def to_df(r):
""""convert arrow table to Dataframe"""
t = to_arrowTable(r)
return t.to_pandas(use_threads=True)
as well as methods for getting right from query
:
def query(sql, output_format="CSV"):
lower_output_format = output_format.lower()
if lower_output_format == "dataframe":
return to_df(_chdb.query(sql, "Arrow"))
elif lower_output_format == 'arrowtable':
return to_arrowTable(_chdb.query(sql, "Arrow"))
else:
return _chdb.query(sql, output_format)
Althought I get an error using the to_arrowTable
method:
In [1]: import chdb
In [3]: res = chdb.query("select count() from s3('https://s3.us-west-2.amazonaws.com/ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-0
...: 1-01_performance_mobile_tiles.parquet')")
In [5]: chdb.to_arrowTable(res)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[5], line 1
----> 1 chdb.to_arrowTable(res)
File ~/code/icedb/venv/lib/python3.10/site-packages/chdb/__init__.py:40, in to_arrowTable(res)
38 if len(res) == 0:
39 return pa.Table.from_batches([], schema=pa.schema([]))
---> 40 return pa.RecordBatchFileReader(res.bytes()).read_all()
File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/ipc.py:110, in RecordBatchFileReader.__init__(self, source, footer_offset, options, memory_pool)
107 def __init__(self, source, footer_offset=None, *, options=None,
108 memory_pool=None):
109 options = _ensure_default_ipc_read_options(options)
--> 110 self._open(source, footer_offset=footer_offset,
111 options=options, memory_pool=memory_pool)
File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/ipc.pxi:942, in pyarrow.lib._RecordBatchFileReader._open()
File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
ArrowInvalid: File is too small: 8
The example from https://github.com/chdb-io/chdb/blob/main/examples/to_pandas.py works fine though (to convert to pyarrow table)
Ok it worked the second time... closing issue
You can just use ArrowTable
format
import chdb
res = chdb.query(
"SELECT town, district, count() AS c, round(avg(price)) AS price "
"FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/uk-house-prices/csv/house_prices.csv.gz') "
"GROUP BY town, district HAVING c >= 100 ORDER BY price DESC LIMIT 10",
"ArrowTable")
print(res)
A convenience method to return a dataframe or pyarrow table from a query result would be great. DuckDB has similar features: https://duckdb.org/docs/api/python/overview.html#result-conversion
I think df and arrow are the most important ones, with arrow being the top because it's easy to get a df from an arrow table with another method call (uses more mem though)