Pushdown dataframe count action

morazow commented 6 years ago

If there is a df.count operation on a connector dataframe, we do not have to send data through network to create an ExasolRDD. We can get the count with a single jdbc call on main connection and create a RDD with empty rows as many as count using sqlContext.

df.count can be detected using requiredColumns from buildScan. For example, if requiredColumns is empty:

Make a single jdbc call with modified Exasol query "SELECT count(*) FROM $originalQuery WHERE $whereClause"
Get count from above call
Create RDD[Row] using sqlContext, sqlContext.sparkContext.parallelize(1 to cnt).map(Row.empty)

Otherwise continue with usual ExasolRDD.

3cham commented 6 years ago

Hi @morazow, Back after crazy week. I think we could also improve the show() method as well. It is often called by user to have a glance at the underlying data. It is only 1 task job:

INFO DAGScheduler: Submitting 1 missing tasks from ResultStage

So in the test environment it runs completely fine. However, if it runs in cluster environment, will cause the known connection error.

One idea is to define kind of metadata info object for the exasol relation. How do you think about this? @morazow

morazow commented 6 years ago

Hello @3cham ,

Sure. I think it is a good idea! Could you please create a separate issue for show?

To be honest I did not test .show yet in distributed environment.

exasol / spark-connector

Pushdown dataframe count action #24