TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

add an option to publish to jdbc database. #705

Closed AliTajeldin closed 7 years ago

AliTajeldin commented 7 years ago

While sqoop is great at parallel export of hdfs files to databases, it would be convenient for users to use an smv level --publish-jdbc or something similar.

Note: We would require a config that determines the flavor of the actual DB used as the DDL for SQL databases is not really standard.

AliTajeldin commented 7 years ago

Increasing priority as this seems to also be the only way to change hive credentials (See #704)

laneb commented 7 years ago

Branch i705_jdbc now has support for

  1. Reading from a JDBC connection with SmvJdbcTable
  2. Exporting through a JDBC connection with --publish-jdbc

For both uses, I have delegated to Spark's DataFrameWriter. Based on discussion with @AliTajeldin, when writing through JDBC I have set DataFrameWriter's save mode to SaveMode.Append, which means that DataFrames should be inserted into tables instead of overwriting them. However, this behavior doesn't seem entirely consistent - I was successfu;l inserting to existing MySQL tables over JDBC, but when I tried to do the same with Derby tables there was a failure that implied that JDBC was trying to create the table. As the use-case inspiring this feature is to write to Hive over JDBC, we should verify that the insertion works as expected for Hive before I make a PR.

laneb commented 7 years ago

Another feature we should consider adding is the option of custom user-specified query, similar to SmvHiveTable . I will investigate how we might accomplish this on Monday.