Select count(*) fails on big amount of data in spark

iwaniwaniwan012 commented 6 years ago

I tested hdfs_fdw with spark. In spark i created table from local file with 100M rows. In spark beeline i count with select count(*),but pg server get an error oom, and spark thift server fails too. Pg 9.6, spark 2.2.0, hdfs_fwd 2.0.3

gabbasb commented 6 years ago

Can you please provide the output of the following commands: beeline : EXPLAIN EXTENDED select count() from big_table; psql : EXPLAIN VERBOSE select count() from big_table;

Please note that Aggregate push down is not available in hdfs_fdw 2.0.3.

iwaniwaniwan012 commented 6 years ago

explain extended select count(*) from test; | == Parsed Logical Plan == 'Project [unresolvedalias('count(1), None)] +- 'UnresolvedRelation test

== Analyzed Logical Plan == count(1): bigint Aggregate [count(1) AS count(1)#28L] +- MetastoreRelation default, test

== Optimized Logical Plan == Aggregate [count(1) AS count(1)#28L] +- Project +- MetastoreRelation default, test

== Physical Plan == HashAggregate(keys=[], functions=[count(1)], output=[count(1)#28L]) +- Exchange SinglePartition +- HashAggregate(keys=[], functions=[partial_count(1)], output=[count#30L]) +- HiveTableScan MetastoreRelation default, test |

EXPLAIN VERBOSE select count(*) from spark_table ; QUERY PLAN

Aggregate (cost=1100002.50..1100002.51 rows=1 width=8) Output: count() -> Foreign Scan on public.spark_table (cost=100000.00..1100000.00 rows=1000 width=0) Output: id, txt, tm Remote SQL: SELECT FROM default.test

gabbasb commented 6 years ago

In case of beeline a map-reduce job will be initiated for doing hash aggregate, in hdfs_fdw case all rows would first get selected which triggers OOM error. This will work when we provide support for pushing down aggregates to the spark/hive server in the remote query.

iwaniwaniwan012 commented 6 years ago

Thanks, are you going to provide this support soon or not?

EnterpriseDB / hdfs_fdw

Select count(*) fails on big amount of data in spark #58

EXPLAIN VERBOSE select count(*) from spark_table ; QUERY PLAN