chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

Binaries instead of Strings in dataframe and arrow exports #92

Closed blackrez closed 1 year ago

blackrez commented 1 year ago

Describe the unexpected behaviour When querying a parquet files with chdb, strings become bytes in dataframe and arrow format. I don't this issue with JSON or CSV.

How to reproduce

The query

SELECT AVG(prix) as prix_moy,
        pdvid,
        name,
        ville,
        type_carburant
FROM  s3('https://********.s3.eu-west-1.amazonaws.com/instantane.parquet', 'Parquet') AS p
LEFT JOIN s3('https://************.s3.eu-west-1.amazonaws.com/station.csv', '*****', '****', 'CSVWithNames') AS stations
ON p.pdvid = stations.id
GROUP BY all
ORDER BY prix_moy DESC;

The results with clickhouse local

Capture d’écran 2023-08-19 à 14 41 11

The result with chdb

In [25]: res
Out[25]:
       prix_moy     pdvid                                    name                        ville type_carburant
0         2.799  49480005  b"BP A11 AIRE DES PORTES D'ANGERS SUD"     b"Saint-Sylvain-D'Anjou"        b'SP98'
1         2.770  75014008                                    None                     b'Paris'        b'SP98'
2         2.740  75014008                                    None                     b'Paris'        b'SP95'
3         2.699  49160003                            b'SARL ROUX'    b'Longu\xc3\xa9-Jumelles'        b'SP98'
4         2.690  75016011                  b'Sarl STATION KLEBER'                     b'Paris'        b'SP98'
blackrez commented 1 year ago

I found the option to use strings instead of binary in arrow.

https://clickhouse.com/docs/en/operations/settings/formats#output_format_arrow_string_as_string

How I can use it in chdb ?

blackrez commented 1 year ago

Found it, just add "SET output_format_arrow_string_as_string=true;" before the request and it works.

Thanks

lmangani commented 1 year ago

Thanks @blackrez for sharing the solution. Should this be documented in any relevant example?

auxten commented 1 year ago

Found it, just add "SET output_format_arrow_string_as_string=true;" before the request and it works.

Thanks

Thanks for the solution, maybe we can turn on the option by default. @nmreadelf

blackrez commented 1 year ago

In streamlit, this options is indispensable. I don't know the consequences but I think it should be activated by default.