CODAIT / spark-netezza

Netezza Connector for Apache Spark
Apache License 2.0
13 stars 7 forks source link

Support Views #13

Open Highbrainer opened 7 years ago

Highbrainer commented 7 years ago

Hi, I have tested version 0.1.1 from maven central.

I discovered that this datasource does not support views because netezza itself does not support the concept of dataslice for views.

java.lang.RuntimeException: Error creating external table pipe:org.netezza.error.NzSQLException: ERROR: Column reference "DATASLICEID" not supported for views

    at com.ibm.spark.netezza.NetezzaDataReader.getNextLine(NetezzaDataReader.scala:149)
    at com.ibm.spark.netezza.NetezzaDataReader.hasNext(NetezzaDataReader.scala:124)
    at com.ibm.spark.netezza.NetezzaRDD$$anon$1.getNext(NetezzaRDD.scala:76)
    at com.ibm.spark.netezza.NetezzaRDD$$anon$1.hasNext(NetezzaRDD.scala:106)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:262)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Maybe we could handle views differently and allow only one worker for views : in my experience, it is faster to get a whole view at once with an external table than trying to retreive it by "classical" means...

And in the future we could even imagine that the user could provide a "dispatch strategy" to let define a custom query builder on a view-per-view basis, and thus make it possible for parallel retreival of views...

What do you think?

Highbrainer commented 7 years ago

Just a precision : I see that the current code (in the master branch) supports the new "query" parameter as an alternative to "dbtable". This answers the first part of my question - the remaining is "How can I define/customize a partioning strategy?".

Also : In which maven repo can I find a version that is more recent than 0.1.1 ?

sureshthalamati commented 7 years ago

Thank you for trying out. As you have noticed query support is not released yet. You can find snap shot jar at the following location: https://oss.sonatype.org/content/repositories/snapshots/com/ibm/SparkTC/ https://oss.sonatype.org/content/repositories/snapshots/com/ibm/SparkTC/spark-netezza_2.11/1.0.0-SNAPSHOT/spark-netezza_2.11-1.0.0-SNAPSHOT.jar

You can specify the integer column to specify the partions strategy. eg: val opts = defaultOpts + ("query" -> s"select * from $tabName") + ("partitioncol" -> "ID") + ("numPartitions" -> Integer.toString(4)) + ("lowerbound" -> "1") + ("upperbound" -> "100")

val testDf = sqlContext.read.format("com.ibm.spark.netezza").options(opts).load()

I should also probably throw a better message for views.