Open scsmithr opened 1 year ago
i think postgres will always be a lot slower as it is row based database & parquet is columnar. A better comparison may be to compare something like a remote csv file & postgres.
We may have to do something like connectorx
and parallelize the queries.
We may have to do something like
connectorx
and parallelize the queries.
Definitely want to add this in eventually. We already get some stats about the table which would let us spin up multiple workers to read.
The slowness I'm observing here shouldn't be from reading a remote source. See here:
lineitem_pg = con.sql("select * from demo_pg.public.lineitem").to_pandas()
lineitem_parquet = con.sql("select * from parquet_scan('../benchmarks/artifacts/tpch_1/lineitem/part-0.parquet')").to_pandas()
I'm loading everything into a data frame, and then querying that data frame. The loading from the remote source shouldn't be counted in the timing.
Context
Using this python script:
Output:
See that the data frame created from querying postgres results in an execution time of 86s. But the data frame created from scanning a parquet file results in an execution time of 3.7s.
Both data frames should contain the same data (as the demo pg was loaded with the tpch sf1 data).
Observed Behavior
Slower execution times on data frames resulting from postgres queries.
Expected Behavior
Execution times should be similar with data frames created from parquet scans or postgres scans.
Possible Solutions
Probably something to do with decimals.