chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

`.to_arrowTable` error `ArrowInvalid: File is too small: 8` (0.13.0, amd64, macos) #94

Closed danthegoodman1 closed 1 year ago

danthegoodman1 commented 1 year ago

A convenience method to return a dataframe or pyarrow table from a query result would be great. DuckDB has similar features: https://duckdb.org/docs/api/python/overview.html#result-conversion

I think df and arrow are the most important ones, with arrow being the top because it's easy to get a df from an arrow table with another method call (uses more mem though)

danthegoodman1 commented 1 year ago

Seems like they are available, and my completion was not working?

# return pyarrow table
def to_arrowTable(res):
    """convert res to arrow table"""
    # try import pyarrow and pandas, if failed, raise ImportError with suggestion
    try:
        import pyarrow as pa
        import pandas
    except ImportError as e:
        print(f'ImportError: {e}')
        print('Please install pyarrow and pandas via "pip install pyarrow pandas"')
        raise ImportError('Failed to import pyarrow or pandas') from None
    if len(res) == 0:
        return pa.Table.from_batches([], schema=pa.schema([]))
    return pa.RecordBatchFileReader(res.bytes()).read_all()

# return pandas dataframe
def to_df(r):
    """"convert arrow table to Dataframe"""
    t = to_arrowTable(r)
    return t.to_pandas(use_threads=True)

as well as methods for getting right from query:

def query(sql, output_format="CSV"):
    lower_output_format = output_format.lower()
    if lower_output_format == "dataframe":
        return to_df(_chdb.query(sql, "Arrow"))
    elif lower_output_format == 'arrowtable':
        return to_arrowTable(_chdb.query(sql, "Arrow"))
    else:
        return _chdb.query(sql, output_format)
danthegoodman1 commented 1 year ago

Althought I get an error using the to_arrowTable method:

In [1]: import chdb

In [3]: res = chdb.query("select count() from s3('https://s3.us-west-2.amazonaws.com/ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-0
   ...: 1-01_performance_mobile_tiles.parquet')")

In [5]: chdb.to_arrowTable(res)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 chdb.to_arrowTable(res)

File ~/code/icedb/venv/lib/python3.10/site-packages/chdb/__init__.py:40, in to_arrowTable(res)
     38 if len(res) == 0:
     39     return pa.Table.from_batches([], schema=pa.schema([]))
---> 40 return pa.RecordBatchFileReader(res.bytes()).read_all()

File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/ipc.py:110, in RecordBatchFileReader.__init__(self, source, footer_offset, options, memory_pool)
    107 def __init__(self, source, footer_offset=None, *, options=None,
    108              memory_pool=None):
    109     options = _ensure_default_ipc_read_options(options)
--> 110     self._open(source, footer_offset=footer_offset,
    111                options=options, memory_pool=memory_pool)

File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/ipc.pxi:942, in pyarrow.lib._RecordBatchFileReader._open()

File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/icedb/venv/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: File is too small: 8
danthegoodman1 commented 1 year ago

The example from https://github.com/chdb-io/chdb/blob/main/examples/to_pandas.py works fine though (to convert to pyarrow table)

danthegoodman1 commented 1 year ago

Ok it worked the second time... closing issue

auxten commented 1 year ago

You can just use ArrowTable format

import chdb

res = chdb.query(
   "SELECT town, district, count() AS c, round(avg(price)) AS price "
   "FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/uk-house-prices/csv/house_prices.csv.gz') "
   "GROUP BY town, district HAVING c >= 100 ORDER BY price DESC LIMIT 10",
   "ArrowTable")
print(res)