EnterpriseDB / hdfs_fdw

PostgreSQL foreign data wrapper for HDFS
Other
134 stars 37 forks source link

Fix issue #46, Eliminate issues in HDFS_FDW to run TPC-H queries. #47

Closed gabbasb closed 7 years ago

gabbasb commented 7 years ago

After fixing the issues all the queries except the ones containing correlated sub queries run fine. Queries 9, 17 and 20 are examples of correlated sub queries and these queries do not return after hours of waiting with a couple of GBs of data.

The patch removes the following issues:

  1. In each query execution FDW used to issue a DESC table_name statement to the hive server. This patch removes the need for issuing this command and hence removes all the code associated with it. Previously the FDW needed column type while retrieving its value. There is no need of column type any more. Hence the need of this round trip is eliminated. This enhances the performance as well cleans up the code.

  2. Adds support for pushing down parametrized WHERE clauses. The query execution is broken down into three steps to make this possible: PREPARE, BIND and EXECUTE. While rescanning the new parameters are bound and the same prepared query is executed again. NULL values of bound parameters are not supported. JDBC Type Time is also not supported. The following parameter types are supported: INT2OID INT4OID INT8OID FLOAT4OID FLOAT8OID NUMERICOID BOOLOID BPCHAROID VARCHAROID TEXTOID JSONOID NAMEOID DATEOID TIMEOID TIMESTAMPOID TIMESTAMPTZOID BITOID

  3. The costing system of FDW was not correct and query costs were coming up as negative values. The patch corrects the costing mechanism. The patch increases the ForeignScan node costs to make sure that the planner inserts materialize nodes at appropriate places in the plan tree to avoid UN-necessary rescans. This is a major improvement and makes most of the TPC-H queries execute in a reasonable time frame.

  4. The option use_remote_estimates was not working. The patch corrects the systax error in the remote query. Also it was using count() to get the rows in the remote table. The execution of a count() on tha hive table took some 4-5 seconds and hence enabling remote estimates was making queries slow. This patch changes the FDW to use EXPLAIN select from remote_table. While this works only if the hive tables are analysed i.e. stats are updated, this technique is very fast as compared to count() and does not slow down the whole query execution. The minimum row count that a remote estimate can return is 1000.

  5. Adds a new option while creating the remote server: fetch_size Default value is 10,000. For example: CREATE SERVER hdfs_server FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '54.90.191.116', use_remote_estimate 'true', fetch_size '100000'); The user specified value is used as parameter to the JDBC API setFetchSize(fetch_size)

  6. Adds a new option while creating the remote server: log_remote_sql Default value is false. If set to true it logs the following:

    • All the SQL's sent to the remote hive server.
    • The number of times a remote scan is repeated i.e. rescanned.
  7. Adds a IS_DEBUG compile time option to add debug messages in log.

  8. Removes UN-necessary members in HDFSFdwScanState structure.

TODO Items:

  1. For accurate row counts we should append where clauses with the EXPLAIN SELECT statement, but if where clause is parametrized we should handle it the way pg fdw does.

  2. Add support for NULL values as parameters

  3. Support JDBC Time type as a parameter.

  4. Specific test cases of each bound parameter type are to be added.