Perf: Querying a dataframe created from postgres query is significantly slower than the parquet equivalent

scsmithr commented 1 year ago

Context

Using this python script:

import glaredb
import pandas as pd
import time

con = glaredb.connect()

def time_run(func):
    start_time = time.monotonic()
    func()
    return time.monotonic() - start_time

con.execute("""
CREATE EXTERNAL DATABASE demo_pg
    FROM postgres
    OPTIONS (
        host = 'pg.demo.glaredb.com',
        port = '5432',
        user = 'demo',
        password = 'demo',
        database = 'postgres',
    );
""")

lineitem_pg = con.sql("select * from demo_pg.public.lineitem").to_pandas()
lineitem_parquet = con.sql("select * from parquet_scan('../benchmarks/artifacts/tpch_1/lineitem/part-0.parquet')").to_pandas()

def q1_pg():
    df = con.sql("""
    select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        lineitem_pg
    where
        l_shipdate <= '1998-09-02'
    group by
        l_returnflag,
        l_linestatus
    order by
        l_returnflag,
        l_linestatus
    """).to_pandas()
    print(df)

def q1_parquet():
    df = con.sql("""
    select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        lineitem_parquet
    where
        l_shipdate <= '1998-09-02'
    group by
        l_returnflag,
        l_linestatus
    order by
        l_returnflag,
        l_linestatus
    """).to_pandas()
    print(df)

pg_time = time_run(q1_pg)
print(f"df from pg: {pg_time}")

parquet_time = time_run(q1_parquet)
print(f"df from parquet: {parquet_time}")

Output:

  l_returnflag l_linestatus             sum_qty          sum_base_price  ...           avg_qty            avg_price         avg_disc count_order
0            A            F  37734107.000000000   56586554400.730000000  ...  25.5220058532573  38273.1297346216722  0.0499852958383     1478493
1            N            F    991417.000000000    1487504710.380000000  ...  25.5164719205229  38284.4677608483039  0.0500934266742       38854
2            N            O  74476040.000000000  111701729697.740000000  ...  25.5022267695849  38249.1179889082699  0.0499965860537     2920374
3            R            F  37719753.000000000   56568041380.900000000  ...  25.5057936126907  38250.8546260996571  0.0500094058301     1478870

[4 rows x 10 columns]
df from pg: 86.64136579097249
  l_returnflag l_linestatus      sum_qty   sum_base_price     sum_disc_price           sum_charge    avg_qty     avg_price  avg_disc  count_order
0            A            F  37734107.00   56586554400.73   53758257134.8700   55909065222.827692  25.522005  38273.129734  0.049985      1478493
1            N            F    991417.00    1487504710.38    1413082168.0541    1469649223.194375  25.516471  38284.467760  0.050093        38854
2            N            O  74476040.00  111701729697.74  106118230307.6056  110367043872.497010  25.502226  38249.117988  0.049996      2920374
3            R            F  37719753.00   56568041380.90   53741292684.6040   55889619119.831932  25.505793  38250.854626  0.050009      1478870
df from parquet: 3.763112667016685

See that the data frame created from querying postgres results in an execution time of 86s. But the data frame created from scanning a parquet file results in an execution time of 3.7s.

Both data frames should contain the same data (as the demo pg was loaded with the tpch sf1 data).

Observed Behavior

Slower execution times on data frames resulting from postgres queries.

Expected Behavior

Execution times should be similar with data frames created from parquet scans or postgres scans.

Possible Solutions

Probably something to do with decimals.

universalmind303 commented 1 year ago

i think postgres will always be a lot slower as it is row based database & parquet is columnar. A better comparison may be to compare something like a remote csv file & postgres.

We may have to do something like connectorx and parallelize the queries.

scsmithr commented 1 year ago

We may have to do something like connectorx and parallelize the queries.

Definitely want to add this in eventually. We already get some stats about the table which would let us spin up multiple workers to read.

The slowness I'm observing here shouldn't be from reading a remote source. See here:

lineitem_pg = con.sql("select * from demo_pg.public.lineitem").to_pandas()
lineitem_parquet = con.sql("select * from parquet_scan('../benchmarks/artifacts/tpch_1/lineitem/part-0.parquet')").to_pandas()

I'm loading everything into a data frame, and then querying that data frame. The loading from the remote source shouldn't be counted in the timing.

GlareDB / glaredb