azkaban dev details - Githubissues

jobtype

spark_to_mysql_and_oracle_full
spark_to_mysql_and_oracle_incremental
spark_to_mysql_full
spark_to_mysql_full_no_partition
spark_to_oracle_full_no_partition
spark_to_mysql_incremental
spark_to_oracle_full
spark_to_oracle_incremental
spark_to_spark
spark_to_oracle_full_all_partition
spark_to_mysql_full_all_partition
oracle_to_spark_full
oracle_to_spark_incremental
mysql_to_spark_full
mysql_to_spark_incremental

有一些需要注意的问题

1、默认数据导入到oracle的kettle中，如果需要修改oracle的schema和password需要在job文件中添加KETTLE_USER=\${KETTLE_DW_USER}和KETTLE_PASSWORD=\${KETTLE_DW_PASSWORD}

2、由于所有的job都是可以重复运行，所以会在export数据到oracle之前需要先truncate oracle需要插入的数据，需要在oracle和mysql中添加create_timestamp字段用来标示数据导入时间

3、对于incremental类型的job导出spark table partition值是dw_audit_cre_date=${YESTERDAY},对于full类型的job导出spark table partition值是dw_audit_cre_date=${CURRENT_DATE}，如果有不同需求可以设置PARTITION_INCREMENTAL_VALUE的值。例如想要incremental类型的job导出今天的分区可以设置PARTITION_INCREMENTAL_VALUE=\$\{CURRENT_DATE\},想导出三天前的数据可以设置PARTITION_INCREMENTAL_VALUE=\$\{LAST_THREE_DAYS\}

4、默认job导出的spark table的分区值是dw_audit_cre_date，如果需要修改可以在job文件中设置PARTITION_COLUMN=XXX

5、在spark中date类型是不带时分秒的，而oracle中的date类型必须要有时分秒，所以在spark的date类型不能直接导出到oracle的date类型，需要使用spark的datetimestamp替换

oracle_to_spark_full，oracle_to_spark_incremental : 需要在job文件中新增参数 id：oracle表主键 spark-script：

如果是oracle_to_spark_full时, spark-script=/home/ubuntu/etl/scala/import_full.sc

如果是oracle_to_spark_incremental时, spark-script=/home/ubuntu/etl/scala/import_incremental.sc

andyuan1618 / python

azkaban dev details #1

jobtype