chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

Include time, rows, and bytes read in query result #85

Closed danthegoodman1 closed 1 year ago

danthegoodman1 commented 1 year ago

Like clickhouse server/local, it would be great if we can get the time spent processing the query, number of rows processed, and the total bytes read for a given query.

lmangani commented 1 year ago

Hey @danthegoodman1 statistics are included when using the JSON* formats, ie: JSONCompact (possibly others) but they really only contain the elapsed time since there is no storage attached and network operations are not accounted for (yet):

:) SELECT 1;
{
    "meta":
    [
        {
            "name": "1",
            "type": "UInt8"
        }
    ],

    "data":
    [
        [1]
    ],

    "rows": 1,

    "statistics":
    {
        "elapsed": 0.002978142,
        "rows_read": 0,
        "bytes_read": 0
    }
}
danthegoodman1 commented 1 year ago

Thanks for the pointer, I think it'd be very valuable to have for df and arrow formats as well as those provide a lot of optimizations.

Or even for CSV for example where I can probabyl stream results back to an HTTP client for example

lmangani commented 1 year ago

Sadly not all formats will allow this without poisoning the dataset results and I doubt CSV could. We need to see where else statistics end up from various format sources. Note statistics are returned at a driver level for most native use cases, and only included in flexible formats such as JSON due to the above.

danthegoodman1 commented 1 year ago

Could they be included in a sort of like x.fetchstats() similar to how DuckDB has a ddb.fetchall() to retrieve results?

lmangani commented 1 year ago

Possibly. That's good question for @auxten I suppose the uniform way would be to return our own chdb statistics and measure response sizes, etc in the middle.

danthegoodman1 commented 1 year ago

@lmangani I would definitely need bytes to be able to use for bighouse, as that's the only fair usage metric as you never know when network hiccups can cause extended query times.