Open iwaniwaniwan012 opened 6 years ago
Can you please provide the output of the following commands: beeline : EXPLAIN EXTENDED select count() from big_table; psql : EXPLAIN VERBOSE select count() from big_table;
Please note that Aggregate push down is not available in hdfs_fdw 2.0.3.
explain extended select count(*) from test;
| == Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)]
+- 'UnresolvedRelation test
== Analyzed Logical Plan == count(1): bigint Aggregate [count(1) AS count(1)#28L] +- MetastoreRelation default, test
== Optimized Logical Plan == Aggregate [count(1) AS count(1)#28L] +- Project +- MetastoreRelation default, test
== Physical Plan == HashAggregate(keys=[], functions=[count(1)], output=[count(1)#28L]) +- Exchange SinglePartition +- HashAggregate(keys=[], functions=[partial_count(1)], output=[count#30L]) +- HiveTableScan MetastoreRelation default, test |
Aggregate (cost=1100002.50..1100002.51 rows=1 width=8) Output: count() -> Foreign Scan on public.spark_table (cost=100000.00..1100000.00 rows=1000 width=0) Output: id, txt, tm Remote SQL: SELECT FROM default.test
In case of beeline a map-reduce job will be initiated for doing hash aggregate, in hdfs_fdw case all rows would first get selected which triggers OOM error. This will work when we provide support for pushing down aggregates to the spark/hive server in the remote query.
Thanks, are you going to provide this support soon or not?
I tested hdfs_fdw with spark. In spark i created table from local file with 100M rows. In spark beeline i count with select count(*),but pg server get an error oom, and spark thift server fails too. Pg 9.6, spark 2.2.0, hdfs_fwd 2.0.3