CODAIT / spark-netezza

Netezza Connector for Apache Spark
Apache License 2.0
13 stars 7 forks source link

Performance and Spark 2.x Compatibility #14

Closed robbyki closed 4 years ago

robbyki commented 7 years ago

Has this been tested with spark 2.x at all? Also are there any benchmarks in terms of performance for extracting data from netezza?

sureshthalamati commented 6 years ago

Hi Robby, Sorry for the delayed reply, I probably missed the notifications of this issue .I am not sure if I tested on 2.x. But the datasource V1 API works in 2.x version also. This package should work in 2.x too.

There is no published benchmarks available. Because this API used optimized API from Netezza it should be better , having said that built-in jdbc is it directly converts from jdbc to internal unsafe row whereas this library has to convert it Row to be not dependent on internals. Only in the current main there is a new API proposed that allows external datasource libraries to use UnSaferow.

You may want to look at built-in jdbc also, considering it has improved a lot , one issue you may run into is data-type mismatches if the happens you can overwrite jdbcdialect. You can find additional information on overwriting dialect in the following blog: https://developer.ibm.com/code/2017/11/29/customize-spark-jdbc-data-source-work-dedicated-database-dialect/

If you are on 2.2. for writing there is createTableColumnTypes option to specify the schema. https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

Hope that helps -suresh

robbyki commented 6 years ago

Thanks @sureshthalamati. I'm using spark 2.1 and have tried and really like createTableColumnTypes but just don't have that version available just yet. This also confirms a lot of what I was thinking regarding the built-in jdbc but the main issue I am running into as you correctly state is with the data types. Particular with respect to having to get around the TEXT invalid type error which the built-in throws and then I have to register a custom dialect however that prevents me from being able to have a custom schema in Netezza to use and maintain. In other words, we need to have VARCHARS which different lengths or even NVARCHARS but my current implementation of the dialect just over-writes the schema with whatever VARCHAR length is defined in dialect. Basically I want to be able to create schemas in Netezza and just maintain those tables by using truncate flag in spark without clobbering my schemas with the dialect. Am I missing something?

sureshthalamati commented 6 years ago

Problem with dialect as you noticed it will be same for all the columns as you noticed. Only workaround I can thin of is to create table explicitly in the Netezza , and the save/write into it. There is also JDBC truncate option also if you need to empty the table before saving , that typically keeps the table as you created.