Open haitham-eltaweel opened 1 year ago
@haitham-eltaweel What date format is MODIFID_DT
present in oracle? what is the datatype?
@haitham-eltaweel What date format is
MODIFID_DT
present in oracle? what is the datatype?
It is timestamp type. The values have this format 'YYYY-MM-DD HH24:MI:SS'
@haitham-eltaweel I dont think there is a way to set default date format. We may want to add additional functionality to have an additional config like hoodie.deltastreamer.jdbc.incr.predicate
to handle such cases for which we can give any custom predicate if source DB doesn't support the default one. Created JIRA for the same -
https://issues.apache.org/jira/browse/HUDI-6727
Feel free to contribute in case you want.
When running HoodieDeltaStreamer to pull new data inclemently from Oracle DB to AWS S3, we get the following error : ORA-01843: not a valid month
We didn't find a way to alter default date format of NLS_DATE_FORMAT to 'YYYY-MM-DD HH24:MI:SS' on the client side within the JDCB connection or Spark configurations.
To Reproduce
Run the following spark submit command (I replaced some configuration values with place holders) :
spark-submit --master yarn --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 15 --executor-cores 5 --executor-memory 30g --driver-memory 15g --name job_name --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --queue q_name --deploy-mode cluster /home/hadoop/hudi-utilities-bundle_2.12-0.11.0-amzn-0.jar --table-type MERGE_ON_READ --target-base-path s3a://bucket-name/path --target-table target_table_name --enable-sync --enable-hive-sync --sync-tool-classes org.apache.hudi.hive.HiveSyncTool --source-class org.apache.hudi.utilities.sources.JdbcSource --source-ordering-field MODIFID_DT --op UPSERT --hoodie-conf hoodie.deltastreamer.jdbc.password=pass_value --hoodie-conf hoodie.deltastreamer.jdbc.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=host_name.com)(PORT=1521))(LOAD_BALANCE=YES)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=SER_NM))) --hoodie-conf hoodie.deltastreamer.jdbc.user=user_name --hoodie-conf hoodie.deltastreamer.jdbc.table.name=schema_name.table_name --hoodie-conf hoodie.deltastreamer.jdbc.incr.pull=true --hoodie-conf hoodie.datasource.write.recordkey.field=pk_id --hoodie-conf hoodie.datasource.write.precombine.field=MODIFID_DT --hoodie-conf hoodie.datasource.write.partitionpath.field=MODIFID_DT --hoodie-conf hoodie.deltastreamer.jdbc.driver.class=oracle.jdbc.driver.OracleDriver --hoodie-conf hoodie.deltastreamer.jdbc.table.incr.column.name=MODIFID_DT --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
Note MODIFID_DT is of type timestamp in the DB table.
Expected behavior
Data will be pulled from the source table to the destination with no date format error.
Environment Description
Amazon EMR version : emr-6.7.0, installed apps : Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1 , Hudi : 0.11
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes
Stacktrace
Caused by: java.sql.SQLDataException: ORA-01843: not a valid month