After fixing the issues all the queries except the ones
containing correlated sub queries run fine.
Queries 9, 17 and 20 are examples of correlated sub queries
and these queries do not return after hours of waiting
with a couple of GBs of data.
The patch removes the following issues:
In each query execution FDW used to issue a DESC table_name
statement to the hive server. This patch removes the need
for issuing this command and hence removes all the code
associated with it.
Previously the FDW needed column type while retrieving
its value. There is no need of column type any more.
Hence the need of this round trip is eliminated.
This enhances the performance as well cleans up the code.
Adds support for pushing down parametrized WHERE clauses.
The query execution is broken down into three steps to make
this possible: PREPARE, BIND and EXECUTE.
While rescanning the new parameters are bound and the same
prepared query is executed again.
NULL values of bound parameters are not supported.
JDBC Type Time is also not supported.
The following parameter types are supported:
INT2OID
INT4OID
INT8OID
FLOAT4OID
FLOAT8OID
NUMERICOID
BOOLOID
BPCHAROID
VARCHAROID
TEXTOID
JSONOID
NAMEOID
DATEOID
TIMEOID
TIMESTAMPOID
TIMESTAMPTZOID
BITOID
The costing system of FDW was not correct and query costs were
coming up as negative values. The patch corrects the costing
mechanism. The patch increases the ForeignScan node costs
to make sure that the planner inserts materialize nodes at
appropriate places in the plan tree to avoid UN-necessary
rescans. This is a major improvement and makes most of the
TPC-H queries execute in a reasonable time frame.
The option use_remote_estimates was not working.
The patch corrects the systax error in the remote query.
Also it was using count() to get the rows in the remote table.
The execution of a count() on tha hive table took some 4-5 seconds
and hence enabling remote estimates was making queries slow.
This patch changes the FDW to use EXPLAIN select from remote_table.
While this works only if the hive tables are analysed i.e. stats are
updated, this technique is very fast as compared to count() and
does not slow down the whole query execution.
The minimum row count that a remote estimate can return is 1000.
Adds a new option while creating the remote server: fetch_size
Default value is 10,000. For example:
CREATE SERVER hdfs_server FOREIGN DATA WRAPPER hdfs_fdw
OPTIONS (host '54.90.191.116',
use_remote_estimate 'true',
fetch_size '100000');
The user specified value is used as parameter to the JDBC API
setFetchSize(fetch_size)
Adds a new option while creating the remote server: log_remote_sql
Default value is false. If set to true it logs the following:
All the SQL's sent to the remote hive server.
The number of times a remote scan is repeated i.e. rescanned.
Adds a IS_DEBUG compile time option to add debug messages in log.
Removes UN-necessary members in HDFSFdwScanState structure.
TODO Items:
For accurate row counts we should append where clauses
with the EXPLAIN SELECT statement, but if where clause
is parametrized we should handle it the way pg fdw does.
Add support for NULL values as parameters
Support JDBC Time type as a parameter.
Specific test cases of each bound parameter type are to be added.
After fixing the issues all the queries except the ones containing correlated sub queries run fine. Queries 9, 17 and 20 are examples of correlated sub queries and these queries do not return after hours of waiting with a couple of GBs of data.
The patch removes the following issues:
In each query execution FDW used to issue a DESC table_name statement to the hive server. This patch removes the need for issuing this command and hence removes all the code associated with it. Previously the FDW needed column type while retrieving its value. There is no need of column type any more. Hence the need of this round trip is eliminated. This enhances the performance as well cleans up the code.
Adds support for pushing down parametrized WHERE clauses. The query execution is broken down into three steps to make this possible: PREPARE, BIND and EXECUTE. While rescanning the new parameters are bound and the same prepared query is executed again. NULL values of bound parameters are not supported. JDBC Type Time is also not supported. The following parameter types are supported: INT2OID INT4OID INT8OID FLOAT4OID FLOAT8OID NUMERICOID BOOLOID BPCHAROID VARCHAROID TEXTOID JSONOID NAMEOID DATEOID TIMEOID TIMESTAMPOID TIMESTAMPTZOID BITOID
The costing system of FDW was not correct and query costs were coming up as negative values. The patch corrects the costing mechanism. The patch increases the ForeignScan node costs to make sure that the planner inserts materialize nodes at appropriate places in the plan tree to avoid UN-necessary rescans. This is a major improvement and makes most of the TPC-H queries execute in a reasonable time frame.
The option use_remote_estimates was not working. The patch corrects the systax error in the remote query. Also it was using count() to get the rows in the remote table. The execution of a count() on tha hive table took some 4-5 seconds and hence enabling remote estimates was making queries slow. This patch changes the FDW to use EXPLAIN select from remote_table. While this works only if the hive tables are analysed i.e. stats are updated, this technique is very fast as compared to count() and does not slow down the whole query execution. The minimum row count that a remote estimate can return is 1000.
Adds a new option while creating the remote server: fetch_size Default value is 10,000. For example: CREATE SERVER hdfs_server FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '54.90.191.116', use_remote_estimate 'true', fetch_size '100000'); The user specified value is used as parameter to the JDBC API setFetchSize(fetch_size)
Adds a new option while creating the remote server: log_remote_sql Default value is false. If set to true it logs the following:
Adds a IS_DEBUG compile time option to add debug messages in log.
Removes UN-necessary members in HDFSFdwScanState structure.
TODO Items:
For accurate row counts we should append where clauses with the EXPLAIN SELECT statement, but if where clause is parametrized we should handle it the way pg fdw does.
Add support for NULL values as parameters
Support JDBC Time type as a parameter.
Specific test cases of each bound parameter type are to be added.