Closed morazow closed 6 years ago
Hi @morazow, Back after crazy week. I think we could also improve the show() method as well. It is often called by user to have a glance at the underlying data. It is only 1 task job:
INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
So in the test environment it runs completely fine. However, if it runs in cluster environment, will cause the known connection error.
One idea is to define kind of metadata info object for the exasol relation. How do you think about this? @morazow
Hello @3cham ,
Sure. I think it is a good idea! Could you please create a separate issue for show
?
To be honest I did not test .show
yet in distributed environment.
If there is a
df.count
operation on a connector dataframe, we do not have to send data through network to create an ExasolRDD. We can get the count with a single jdbc call on main connection and create a RDD with empty rows as many as count usingsqlContext
.df.count
can be detected usingrequiredColumns
from buildScan. For example, ifrequiredColumns
is empty:"SELECT count(*) FROM $originalQuery WHERE $whereClause"
sqlContext.sparkContext.parallelize(1 to cnt).map(Row.empty)
Otherwise continue with usual
ExasolRDD
.