apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: <hivedb.tableName> table not found #954

Closed gfn9cho closed 4 years ago

gfn9cho commented 4 years ago

I am using hudi-spark-bundle-0.5.1-SNAPSHOT.jar in EMR and getting the below exception in hiveSync. We are using AWS glue catalog for hive metastore. Hive table is getting created. I could see the table in hive with no data in it.

> org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table <tableName>
>   at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:172)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:107)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:67)
>   at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   ... 69 elided
> Caused by: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: <hiveDB>.<tableName> table not found
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result.read(ThriftHiveMetastore.java)
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions(ThriftHiveMetastore.java:2377)
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$Client.get_partitions(ThriftHiveMetastore.java:2362)
>   at org.apache.hudi.org.apache.hadoop_hive.metastore.HiveMetaStoreClient.listPartitions(HiveMetaStoreClient.java:1162)
>   at org.apache.hudi.hive.HoodieHiveClient.scanTablePartitions(HoodieHiveClient.java:240)
>   at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:162)
>   ... 95 more
> 

Below is the code,

spark-shell --master yarn --deploy-mode client  --conf spark.shuffle.spill=true \
 --conf spark.scheduler.mode=FIFO \
 --conf spark.executor.extraJavaOptions=-XX:MaxPermSize=1024m \
 --conf spark.sql.planner.externalSort=true --conf spark.shuffle.manager=sort \
 --conf spark.ui.port=8088 --conf spark.executor.memoryOverhead=2g  \
 --conf spark.rpc.message.maxSize=1024 --conf spark.file.transferTo=false \
 --conf spark.driver.maxResultSize=3g --conf spark.rdd.compress=true \
 --conf spark.executor.extraJavaOptions="-Dconfig.resource=spark-defaults.conf" \
 --conf spark.driver.JavaOptions="-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop" \
 --conf spark.driver.extraJavaOptions="-Dconfig.file=spark-defaults.conf" \
 --conf spark.sql.parquet.writeLegacyFormat=true \
 --conf spark.enable.dynamicAllocation=true \
 --conf spark.dynamicAllocation.maxExecutors=10 \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.executor.cores=5 \
 --conf spark.executor.memory=3g --conf spark.driver.memory=2g  \
 --conf spark.executor.instances=4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer  \
 --name gwpl_staging_load_hudi \
 --files /etc/spark/conf/hive-site.xml \
 --properties-file /usr/lib/spark/conf/spark-defaults.conf \
 --jars /home/hadoop/hudi/hudi-spark-bundle-0.5.1-SNAPSHOT.jar 

import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql._
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.joda.time.format.DateTimeFormat

val stagePrefix="stg_gwpl"
val harmonizedStageDB=<hiveDB>
val harmonizedstagePath="s3://****/**"
val table="pc_policy"

val incrementalData=spark.sql("select * from <hivetable> limit 100").cache

incrementalData.write.
format("org.apache.hudi").
option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"ID").
option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "ingestiondt").
option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "UpdateTime").
option(HoodieWriteConfig.TABLE_NAME, stagePrefix + "_hudi_" + table).
option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2:hiveserver:10000").
option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "hive").
option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "hive").
option("hoodie.datasource.hive_sync.enable", true).
option("hoodie.datasource.hive_sync.database",harmonizedStageDB).
option("hoodie.datasource.hive_sync.table",stagePrefix + "_hudi_" + table).
option("hoodie.datasource.hive_sync.partition_fields","ingestiondt").
mode(SaveMode.Overwrite).
save(s"${harmonizedstagePath}/hudi/$table")

Please let me know if I can provide more details to it.

vinothchandar commented 4 years ago

@gfn9cho Is this the first write? Looks like it cannot find the registered table. could you share the entire log with other statements also?

gfn9cho commented 4 years ago

Yes, this is the first write. Its creating the hoodie table and I could see the data in S3. When it comes to Hive, its creating the table, but failed to sync the data with the above error. Actually that was the entire log in the spark shell. Below is the one item I missed, 19/10/14 01:30:23 WARN HiveConf: HiveConf of name hive.metastore.client.factory.class does not exist

I could do spark.sql("select * from ") successfully with no data in it.

vinothchandar commented 4 years ago

There are definitely other lines that are not getting printed.. Have you tried doing a sc.setLogLevel("INFO") to also get the INFO statements.. usually this print the exact SQL being run and we can spot something

gfn9cho commented 4 years ago

Thanks Vinoth. Looks like everytime HiveSync tool is executing the create DDL as it is not able to find the table from the metastore. Here is the trimmed log from the time it finished writing to hoodie table.

19/10/15 03:05:28 INFO TaskSetManager: Finished task 1485.0 in stage 25.0 (TID 12144) in 16 ms on ip-10-63-115-75.corp.stateauto.com (executor 1) (1498/1500)
19/10/15 03:05:28 INFO TaskSetManager: Finished task 1499.0 in stage 25.0 (TID 12147) in 9 ms on ip-10-63-115-75.corp.stateauto.com (executor 1) (1499/1500)
19/10/15 03:05:28 INFO TaskSetManager: Finished task 1487.0 in stage 25.0 (TID 12146) in 10 ms on ip-10-63-115-75.corp.stateauto.com (executor 1) (1500/1500)
19/10/15 03:05:28 INFO YarnScheduler: Removed TaskSet 25.0, whose tasks have all completed, from pool 
19/10/15 03:05:28 INFO DAGScheduler: ShuffleMapStage 25 (mapToPair at HoodieWriteClient.java:461) finished in 1.448 s
19/10/15 03:05:28 INFO DAGScheduler: looking for newly runnable stages
19/10/15 03:05:28 INFO DAGScheduler: running: Set()
19/10/15 03:05:28 INFO DAGScheduler: waiting: Set(ResultStage 26)
19/10/15 03:05:28 INFO DAGScheduler: failed: Set()
19/10/15 03:05:28 INFO DAGScheduler: Submitting ResultStage 26 (MapPartitionsRDD[55] at filter at HoodieSparkSqlWriter.scala:145), which has no missing parents
19/10/15 03:05:28 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 156.3 KB, free 911.2 MB)
19/10/15 03:05:28 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 58.4 KB, free 911.1 MB)
19/10/15 03:05:28 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on ip-10-63-114-58.corp.stateauto.com:43403 (size: 58.4 KB, free: 912.1 MB)
19/10/15 03:05:28 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1201
19/10/15 03:05:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 26 (MapPartitionsRDD[55] at filter at HoodieSparkSqlWriter.scala:145) (first 15 tasks are for partitions Vector(0))
19/10/15 03:05:28 INFO YarnScheduler: Adding task set 26.0 with 1 tasks
19/10/15 03:05:28 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 12148, ip-10-63-114-115.corp.stateauto.com, executor 2, partition 0, PROCESS_LOCAL, 7674 bytes)
19/10/15 03:05:28 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on ip-10-63-114-115.corp.stateauto.com:36209 (size: 58.4 KB, free: 1458.3 MB)
19/10/15 03:05:28 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 8 to 10.63.114.115:42194
[Stage 26:>                                                         (0 + 1) / 1]19/10/15 03:05:31 INFO BlockManagerInfo: Added rdd_54_0 in memory on ip-10-63-114-115.corp.stateauto.com:36209 (size: 300.0 B, free: 1458.3 MB)
19/10/15 03:05:31 INFO TaskSetManager: Finished task 0.0 in stage 26.0 (TID 12148) in 2940 ms on ip-10-63-114-115.corp.stateauto.com (executor 2) (1/1)
19/10/15 03:05:31 INFO YarnScheduler: Removed TaskSet 26.0, whose tasks have all completed, from pool 
19/10/15 03:05:31 INFO DAGScheduler: ResultStage 26 (count at HoodieSparkSqlWriter.scala:145) finished in 2.957 s
19/10/15 03:05:31 INFO DAGScheduler: Job 7 finished: count at HoodieSparkSqlWriter.scala:145, took 4.414884 s
19/10/15 03:05:31 INFO HoodieSparkSqlWriter$: No errors. Proceeding to commit the write.
19/10/15 03:05:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:31 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:31 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO HoodieWriteClient: Commiting 20191015030518
19/10/15 03:05:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:31 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:31 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@4145ffec
19/10/15 03:05:31 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
19/10/15 03:05:31 INFO FileSystemViewManager: Creating in-memory based Table View
19/10/15 03:05:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:31 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:31 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@9d5efdd
19/10/15 03:05:32 INFO SparkContext: Starting job: collect at HoodieWriteClient.java:492
19/10/15 03:05:32 INFO DAGScheduler: Got job 8 (collect at HoodieWriteClient.java:492) with 1 output partitions
19/10/15 03:05:32 INFO DAGScheduler: Final stage: ResultStage 33 (collect at HoodieWriteClient.java:492)
19/10/15 03:05:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 32)
19/10/15 03:05:32 INFO DAGScheduler: Missing parents: List()
19/10/15 03:05:32 INFO DAGScheduler: Submitting ResultStage 33 (MapPartitionsRDD[56] at map at HoodieWriteClient.java:492), which has no missing parents
19/10/15 03:05:32 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 156.5 KB, free 911.4 MB)
19/10/15 03:05:32 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 58.5 KB, free 911.4 MB)
19/10/15 03:05:32 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on ip-10-63-114-58.corp.stateauto.com:43403 (size: 58.5 KB, free: 912.1 MB)
19/10/15 03:05:32 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:1201
19/10/15 03:05:32 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 33 (MapPartitionsRDD[56] at map at HoodieWriteClient.java:492) (first 15 tasks are for partitions Vector(0))
19/10/15 03:05:32 INFO YarnScheduler: Adding task set 33.0 with 1 tasks
19/10/15 03:05:32 INFO TaskSetManager: Starting task 0.0 in stage 33.0 (TID 12149, ip-10-63-114-115.corp.stateauto.com, executor 2, partition 0, PROCESS_LOCAL, 7674 bytes)
19/10/15 03:05:32 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on ip-10-63-114-115.corp.stateauto.com:36209 (size: 58.5 KB, free: 1458.4 MB)
19/10/15 03:05:32 INFO TaskSetManager: Finished task 0.0 in stage 33.0 (TID 12149) in 67 ms on ip-10-63-114-115.corp.stateauto.com (executor 2) (1/1)
19/10/15 03:05:32 INFO YarnScheduler: Removed TaskSet 33.0, whose tasks have all completed, from pool 
19/10/15 03:05:32 INFO DAGScheduler: ResultStage 33 (collect at HoodieWriteClient.java:492) finished in 0.086 s
19/10/15 03:05:32 INFO DAGScheduler: Job 8 finished: collect at HoodieWriteClient.java:492, took 0.089019 s
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:32 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:32 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:32 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@28576562
19/10/15 03:05:32 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
19/10/15 03:05:32 INFO FileSystemViewManager: Creating in-memory based Table View
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:32 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:32 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:32 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@1c7740da
19/10/15 03:05:32 INFO HoodieTable: Removing marker directory=s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/.temp/20191015030518
19/10/15 03:05:32 INFO HoodieActiveTimeline: Marking instant complete [==>20191015030518__commit__INFLIGHT]
19/10/15 03:05:32 INFO MultipartUploadOutputStream: close closed:false s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.inflight
19/10/15 03:05:32 INFO S3NativeFileSystem: rename s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.inflight s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.commit
19/10/15 03:05:32 INFO HoodieActiveTimeline: Completed [==>20191015030518__commit__INFLIGHT]
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:32 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:32 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:32 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@24f2a2b6
19/10/15 03:05:32 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
19/10/15 03:05:32 INFO FileSystemViewManager: Creating in-memory based Table View
19/10/15 03:05:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:32 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:33 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:33 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:33 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@2005e23d
19/10/15 03:05:33 INFO HoodieCommitArchiveLog: No Instants to archive
19/10/15 03:05:33 INFO HoodieWriteClient: Auto cleaning is enabled. Running cleaner now
19/10/15 03:05:33 INFO HoodieWriteClient: Cleaner started
19/10/15 03:05:33 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:33 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:33 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:33 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@2d6fce1c
19/10/15 03:05:33 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
19/10/15 03:05:33 INFO FileSystemViewManager: Creating in-memory based Table View
19/10/15 03:05:33 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:33 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:33 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:33 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:33 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@35b68b7a
19/10/15 03:05:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2018-07-01], with policy KEEP_LATEST_COMMITS
19/10/15 03:05:33 INFO HoodieCopyOnWriteTable: Using cleanerParallelism: 1
19/10/15 03:05:33 INFO SparkContext: Starting job: collect at HoodieCopyOnWriteTable.java:396
19/10/15 03:05:33 INFO DAGScheduler: Registering RDD 59 (repartition at HoodieCopyOnWriteTable.java:392)
19/10/15 03:05:33 INFO DAGScheduler: Registering RDD 63 (mapPartitionsToPair at HoodieCopyOnWriteTable.java:393)
19/10/15 03:05:33 INFO DAGScheduler: Got job 9 (collect at HoodieCopyOnWriteTable.java:396) with 1 output partitions
19/10/15 03:05:33 INFO DAGScheduler: Final stage: ResultStage 36 (collect at HoodieCopyOnWriteTable.java:396)
19/10/15 03:05:33 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 35)
19/10/15 03:05:33 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 35)
19/10/15 03:05:33 INFO DAGScheduler: Submitting ShuffleMapStage 34 (MapPartitionsRDD[59] at repartition at HoodieCopyOnWriteTable.java:392), which has no missing parents
19/10/15 03:05:33 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 154.0 KB, free 911.2 MB)
19/10/15 03:05:33 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 57.3 KB, free 911.1 MB)
19/10/15 03:05:33 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on ip-10-63-114-58.corp.stateauto.com:43403 (size: 57.3 KB, free: 912.1 MB)
19/10/15 03:05:33 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1201
19/10/15 03:05:33 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 34 (MapPartitionsRDD[59] at repartition at HoodieCopyOnWriteTable.java:392) (first 15 tasks are for partitions Vector(0))
19/10/15 03:05:33 INFO YarnScheduler: Adding task set 34.0 with 1 tasks
19/10/15 03:05:33 INFO TaskSetManager: Starting task 0.0 in stage 34.0 (TID 12150, ip-10-63-114-114.corp.stateauto.com, executor 4, partition 0, PROCESS_LOCAL, 7734 bytes)
19/10/15 03:05:33 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on ip-10-63-114-114.corp.stateauto.com:42747 (size: 57.3 KB, free: 1458.5 MB)
[Stage 34:>                                                         (0 + 1) / 1]19/10/15 03:05:35 INFO TaskSetManager: Finished task 0.0 in stage 34.0 (TID 12150) in 2233 ms on ip-10-63-114-114.corp.stateauto.com (executor 4) (1/1)
19/10/15 03:05:35 INFO YarnScheduler: Removed TaskSet 34.0, whose tasks have all completed, from pool 
19/10/15 03:05:35 INFO DAGScheduler: ShuffleMapStage 34 (repartition at HoodieCopyOnWriteTable.java:392) finished in 2.252 s
19/10/15 03:05:35 INFO DAGScheduler: looking for newly runnable stages
19/10/15 03:05:35 INFO DAGScheduler: running: Set()
19/10/15 03:05:35 INFO DAGScheduler: waiting: Set(ShuffleMapStage 35, ResultStage 36)
19/10/15 03:05:35 INFO DAGScheduler: failed: Set()
19/10/15 03:05:35 INFO DAGScheduler: Submitting ShuffleMapStage 35 (MapPartitionsRDD[63] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393), which has no missing parents
19/10/15 03:05:35 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 154.6 KB, free 911.0 MB)
19/10/15 03:05:35 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 57.4 KB, free 910.9 MB)
19/10/15 03:05:35 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on ip-10-63-114-58.corp.stateauto.com:43403 (size: 57.4 KB, free: 912.0 MB)
19/10/15 03:05:35 INFO SparkContext: Created broadcast 18 from broadcast at DAGScheduler.scala:1201
19/10/15 03:05:35 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 35 (MapPartitionsRDD[63] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393) (first 15 tasks are for partitions Vector(0))
19/10/15 03:05:35 INFO YarnScheduler: Adding task set 35.0 with 1 tasks
19/10/15 03:05:35 INFO TaskSetManager: Starting task 0.0 in stage 35.0 (TID 12151, ip-10-63-114-114.corp.stateauto.com, executor 4, partition 0, PROCESS_LOCAL, 7939 bytes)
19/10/15 03:05:35 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on ip-10-63-114-114.corp.stateauto.com:42747 (size: 57.4 KB, free: 1458.5 MB)
19/10/15 03:05:35 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 9 to 10.63.114.114:46592
19/10/15 03:05:35 INFO TaskSetManager: Finished task 0.0 in stage 35.0 (TID 12151) in 95 ms on ip-10-63-114-114.corp.stateauto.com (executor 4) (1/1)
19/10/15 03:05:35 INFO YarnScheduler: Removed TaskSet 35.0, whose tasks have all completed, from pool 
19/10/15 03:05:35 INFO DAGScheduler: ShuffleMapStage 35 (mapPartitionsToPair at HoodieCopyOnWriteTable.java:393) finished in 0.114 s
19/10/15 03:05:35 INFO DAGScheduler: looking for newly runnable stages
19/10/15 03:05:35 INFO DAGScheduler: running: Set()
19/10/15 03:05:35 INFO DAGScheduler: waiting: Set(ResultStage 36)
19/10/15 03:05:35 INFO DAGScheduler: failed: Set()
19/10/15 03:05:35 INFO DAGScheduler: Submitting ResultStage 36 (ShuffledRDD[64] at reduceByKey at HoodieCopyOnWriteTable.java:393), which has no missing parents
19/10/15 03:05:35 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 4.6 KB, free 910.9 MB)
19/10/15 03:05:35 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 2.6 KB, free 910.9 MB)
19/10/15 03:05:35 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on ip-10-63-114-58.corp.stateauto.com:43403 (size: 2.6 KB, free: 912.0 MB)
19/10/15 03:05:35 INFO SparkContext: Created broadcast 19 from broadcast at DAGScheduler.scala:1201
19/10/15 03:05:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 36 (ShuffledRDD[64] at reduceByKey at HoodieCopyOnWriteTable.java:393) (first 15 tasks are for partitions Vector(0))
19/10/15 03:05:35 INFO YarnScheduler: Adding task set 36.0 with 1 tasks
19/10/15 03:05:35 INFO TaskSetManager: Starting task 0.0 in stage 36.0 (TID 12152, ip-10-63-114-115.corp.stateauto.com, executor 2, partition 0, PROCESS_LOCAL, 7674 bytes)
19/10/15 03:05:35 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on ip-10-63-114-115.corp.stateauto.com:36209 (size: 2.6 KB, free: 1458.4 MB)
19/10/15 03:05:35 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.63.114.115:42194
19/10/15 03:05:35 INFO TaskSetManager: Finished task 0.0 in stage 36.0 (TID 12152) in 12 ms on ip-10-63-114-115.corp.stateauto.com (executor 2) (1/1)
19/10/15 03:05:35 INFO YarnScheduler: Removed TaskSet 36.0, whose tasks have all completed, from pool 
19/10/15 03:05:35 INFO DAGScheduler: ResultStage 36 (collect at HoodieCopyOnWriteTable.java:396) finished in 0.018 s
19/10/15 03:05:35 INFO DAGScheduler: Job 9 finished: collect at HoodieCopyOnWriteTable.java:396, took 2.390622 s
19/10/15 03:05:35 INFO FileSystemViewManager: Creating InMemory based view for basePath s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:35 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:35 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:35 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:35 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:35 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@4d705b45
19/10/15 03:05:35 INFO HoodieWriteClient: Cleaned 0 files
19/10/15 03:05:35 INFO HoodieActiveTimeline: Marking instant complete [==>20191015030518__clean__INFLIGHT]
19/10/15 03:05:36 INFO MultipartUploadOutputStream: close closed:false s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.clean.inflight
19/10/15 03:05:36 INFO HoodieActiveTimeline: Created a new file in meta path: s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.clean.inflight
19/10/15 03:05:36 INFO MultipartUploadOutputStream: close closed:false s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.clean.inflight
19/10/15 03:05:36 INFO S3NativeFileSystem: rename s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.clean.inflight s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.clean
19/10/15 03:05:36 INFO HoodieActiveTimeline: Completed [==>20191015030518__clean__INFLIGHT]
19/10/15 03:05:36 INFO HoodieWriteClient: Marked clean started on 20191015030518 as complete
19/10/15 03:05:36 INFO HoodieWriteClient: Committed 20191015030518
19/10/15 03:05:36 INFO HoodieSparkSqlWriter$: Commit 20191015030518 successful!
19/10/15 03:05:36 INFO HoodieSparkSqlWriter$: Syncing to Hive Metastore (URL: jdbc:hive2://ip-10-63-114-58.corp.stateauto.com:10000)
19/10/15 03:05:36 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:36 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml
19/10/15 03:05:36 WARN HiveConf: HiveConf of name hive.metastore.client.factory.class does not exist
19/10/15 03:05:36 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:36 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-63-114-58.corp.stateauto.com:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, {yarn.ipc.rpc.class=org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC, mapreduce.job.maxtaskfailures.per.tracker=3, yarn.client.max-cached-nodemanagers-proxies=0, mapreduce.job.speculative.retry-after-speculate=15000, ha.health-monitor.connect-retry-interval.ms=1000, yarn.resourcemanager.work-preserving-recovery.enabled=true, mapreduce.reduce.markreset.buffer.percent=0.0, dfs.datanode.data.dir=/mnt/hdfs, mapreduce.jobhistory.max-age-ms=604800000, mapreduce.job.ubertask.enable=false, yarn.nodemanager.log-aggregation.compression-type=none, hive.metastore.connect.retries=15, mapreduce.job.complete.cancel.delegation.tokens=true, yarn.app.mapreduce.am.jhs.backup-dir=file:///var/log/hadoop-mapreduce/history, mapreduce.jobhistory.datestring.cache.size=200000, hadoop.security.kms.client.authentication.retry-count=1, hadoop.ssl.enabled.protocols=TLSv1,SSLv2Hello,TLSv1.1,TLSv1.2, yarn.resourcemanager.scheduler.address=ip-10-63-114-58.corp.stateauto.com:8030, hadoop.http.cross-origin.enabled=false, yarn.resourcemanager.proxy-user-privileges.enabled=false, mapreduce.reduce.shuffle.fetch.retry.enabled=${yarn.nodemanager.recovery.enabled}, io.mapfile.bloom.error.rate=0.005, yarn.nodemanager.resourcemanager.minimum.version=NONE, yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000, hadoop.http.cross-origin.allowed-headers=X-Requested-With,Content-Type,Accept,Origin, yarn.nodemanager.delete.debug-delay-sec=0, hadoop.proxyuser.hue.hosts=*, yarn.scheduler.maximum-allocation-vcores=128, yarn.timeline-service.address=${yarn.timeline-service.hostname}:10200, hadoop.job.history.user.location=none, ipc.maximum.response.length=134217728, yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb=0, mapreduce.job.hdfs-servers=${fs.defaultFS}, mapreduce.task.profile.reduce.params=${mapreduce.task.profile.params}, ftp.stream-buffer-size=4096, hadoop.http.cross-origin.allowed-methods=GET,POST,HEAD, fs.s3a.buffer.dir=${hadoop.tmp.dir}/s3a, yarn.client.application-client-protocol.poll-interval-ms=200, yarn.timeline-service.leveldb-timeline-store.path=${hadoop.tmp.dir}/yarn/timeline, mapreduce.job.split.metainfo.maxsize=10000000, fs.s3a.fast.upload.buffer=disk, s3native.bytes-per-checksum=512, mapred.output.direct.EmrFileSystem=true, yarn.client.failover-retries-on-socket-timeouts=0, hadoop.security.sensitive-config-keys=
      secret$
      password$
      ssl.keystore.pass$
      fs.s3.*[Ss]ecret.?[Kk]ey
      fs.azure.account.key.*
      credential$
      oauth.*token$
      hadoop.security.sensitive-config-keys
  , yarn.timeline-service.client.retry-interval-ms=1000, hadoop.http.authentication.type=simple, mapreduce.local.clientfactory.class.name=org.apache.hadoop.mapred.LocalClientFactory, ipc.client.connection.maxidletime=10000, ipc.server.max.connections=0, mapreduce.jobhistory.recovery.store.leveldb.path=${hadoop.tmp.dir}/mapred/history/recoverystore, fs.s3a.multipart.purge.age=86400, yarn.timeline-service.client.best-effort=false, mapreduce.job.ubertask.maxmaps=9, yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage=90.0, mapreduce.ifile.readahead.bytes=4194304, yarn.sharedcache.uploader.server.thread-count=50, mapreduce.jobhistory.admin.address=0.0.0.0:10033, s3.client-write-packet-size=65536, yarn.app.mapreduce.am.resource.cpu-vcores=1, yarn.nodemanager.node-labels.provider.configured-node-partition=CORE, mapreduce.input.lineinputformat.linespermap=1, mapreduce.reduce.shuffle.input.buffer.percent=0.70, hadoop.http.staticuser.user=dr.who, mapreduce.reduce.maxattempts=4, hadoop.security.group.mapping.ldap.search.filter.user=(&(objectClass=user)(sAMAccountName={0})), mapreduce.jobhistory.admin.acl=*, hadoop.workaround.non.threadsafe.getpwuid=true, mapreduce.map.maxattempts=4, yarn.timeline-service.entity-group-fs-store.active-dir=/tmp/entity-file-history/active, yarn.resourcemanager.zk-retry-interval-ms=1000, mapreduce.jobhistory.cleaner.interval-ms=86400000, dfs.permissions.superusergroup=hadoop, yarn.is.minicluster=false, yarn.application.classpath=
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,
        /usr/lib/hadoop-lzo/lib/*,
        /usr/share/aws/emr/emrfs/conf,
        /usr/share/aws/emr/emrfs/lib/*,
        /usr/share/aws/emr/emrfs/auxlib/*,
        /usr/share/aws/emr/lib/*,
        /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,
        /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,
        /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar,
        /usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar,
        /usr/lib/spark/yarn/lib/datanucleus-core.jar,
        /usr/lib/spark/yarn/lib/datanucleus-rdbms.jar,
        /usr/share/aws/emr/cloudwatch-sink/lib/*,
        /usr/share/aws/aws-java-sdk/*
     , fs.s3n.block.size=67108864, hadoop.registry.system.acls=sasl:yarn@, sasl:mapred@, sasl:hdfs@, yarn.nodemanager.node-labels.provider.fetch-timeout-ms=1200000, yarn.sharedcache.store.in-memory.check-period-mins=720, fs.s3a.multiobjectdelete.enable=true, mapreduce.map.skip.proc-count.auto-incr=true, yarn.nodemanager.vmem-check-enabled=true, hadoop.security.authentication=simple, mapreduce.reduce.skip.proc-count.auto-incr=true, mapreduce.reduce.cpu.vcores=1, net.topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping, fs.s3.sleepTimeSeconds=10, mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem=true, yarn.timeline-service.ttl-ms=604800000, yarn.sharedcache.root-dir=/sharedcache, yarn.resourcemanager.keytab=/etc/krb5.keytab, yarn.resourcemanager.container.liveness-monitor.interval-ms=600000, yarn.node-labels.fs-store.root-dir=/apps/yarn/nodelabels, hadoop.security.group.mapping.ldap.posix.attr.gid.name=gidNumber, yarn.web-proxy.address=ip-10-63-114-58.corp.stateauto.com:20888, yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms=1000, yarn.log-aggregation.enable-local-cleanup=false, yarn.app.mapreduce.client-am.ipc.max-retries-on-timeouts=3, yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn, s3.bytes-per-checksum=512, hadoop.ssl.require.client.cert=false, mapreduce.output.fileoutputformat.compress=false, yarn.resourcemanager.node-labels.provider.fetch-interval-ms=1800000, yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled=true, mapreduce.shuffle.max.threads=0, yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms=1000, s3native.client-write-packet-size=65536, mapreduce.client.submit.file.replication=10, yarn.app.mapreduce.am.job.committer.commit-window=10000, yarn.nodemanager.sleep-delay-before-sigkill.ms=250, yarn.nodemanager.env-whitelist=JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME, mapreduce.map.speculative=true, mapreduce.job.speculative.slowtaskthreshold=1.0, yarn.nodemanager.linux-container-executor.cgroups.mount=false, mapreduce.tasktracker.http.threads=60, mapreduce.jobhistory.http.policy=HTTP_ONLY, ipc.client.low-latency=false, fs.s3a.paging.maximum=5000, mapreduce.jvm.system-properties-to-log=os.name,os.version,java.home,java.runtime.version,java.vendor,java.version,java.vm.name,java.class.path,java.io.tmpdir,user.dir,user.name, hadoop.kerberos.min.seconds.before.relogin=60, yarn.resourcemanager.nodemanager-connect-retries=10, fs.s3.buffer.dir=/mnt/s3, io.native.lib.available=true, mapreduce.jobhistory.done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done, hadoop.registry.zk.retry.interval.ms=1000, mapreduce.job.reducer.unconditional-preempt.delay.sec=300, hadoop.ssl.hostname.verifier=DEFAULT, mapreduce.task.timeout=600000, yarn.resourcemanager.configuration.file-system-based-store=/yarn/conf, yarn.nodemanager.disk-health-checker.interval-ms=120000, adl.feature.ownerandgroup.enableupn=false, dfs.namenode.replication.max-streams-hard-limit=40, hadoop.security.groups.cache.secs=300, mapreduce.input.fileinputformat.split.minsize=0, yarn.minicluster.control-resource-monitoring=false, yarn.resourcemanager.fail-fast=${yarn.fail-fast}, hadoop.proxyuser.hue.groups=*, mapreduce.shuffle.port=13562, hadoop.rpc.protection=authentication, hadoop.proxyuser.hadoop.hosts=*, yarn.timeline-service.recovery.enabled=false, yarn.client.failover-proxy-provider=org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider, ipc.client.tcpnodelay=true, fs.s3.maxRetries=4, mapreduce.jobtracker.address=local, hadoop.http.authentication.kerberos.principal=HTTP/_HOST@LOCALHOST, hadoop.security.group.mapping.ldap.posix.attr.uid.name=uidNumber, fs.s3bfs.impl=org.apache.hadoop.fs.s3.S3FileSystem, yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs=3600, yarn.resourcemanager.webapp.address=${yarn.resourcemanager.hostname}:8088, yarn.timeline-service.client.max-retries=30, mapreduce.task.profile.reduces=0-2, yarn.resourcemanager.am.max-attempts=2, dfs.bytes-per-checksum=512, mapreduce.job.end-notification.max.retry.interval=5000, ipc.client.connect.retry.interval=1000, fs.s3a.multipart.size=100M, yarn.app.mapreduce.am.command-opts=-Xmx12288m, yarn.nodemanager.process-kill-wait.ms=2000, yarn.timeline-service.state-store-class=org.apache.hadoop.yarn.server.timeline.recovery.LeveldbTimelineStateStore, yarn.timeline-service.client.fd-clean-interval-secs=60, mapreduce.job.speculative.minimum-allowed-tasks=10, hadoop.jetty.logs.serve.aliases=true, mapreduce.reduce.shuffle.fetch.retry.timeout-ms=30000, fs.du.interval=600000, yarn.nodemanager.node-labels.provider.fetch-interval-ms=600000, yarn.sharedcache.admin.address=0.0.0.0:8047, yarn.acl.reservation-enable=false, hadoop.proxyuser.httpfs.groups=hudson,testuser,root,hadoop,jenkins,oozie,hive,httpfs,hue,users, hadoop.security.random.device.file.path=/dev/urandom, mapreduce.task.merge.progress.records=10000, dfs.webhdfs.enabled=true, yarn.nodemanager.container-metrics.period-ms=-1, hadoop.registry.secure=false, hadoop.ssl.client.conf=ssl-client.xml, mapreduce.job.counters.max=120, yarn.nodemanager.localizer.fetch.thread-count=20, io.mapfile.bloom.size=1048576, yarn.nodemanager.localizer.client.thread-count=20, fs.automatic.close=true, mapreduce.task.profile=false, yarn.nodemanager.recovery.compaction-interval-secs=3600, mapreduce.task.combine.progress.records=10000, mapreduce.shuffle.ssl.file.buffer.size=65536, yarn.app.mapreduce.client.job.max-retries=0, fs.swift.impl=org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem, yarn.app.mapreduce.am.container.log.backups=0, dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction=1.0, yarn.minicluster.fixed.ports=false, mapreduce.app-submission.cross-platform=false, yarn.timeline-service.ttl-enable=true, yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled=false, yarn.nodemanager.keytab=/etc/krb5.keytab, yarn.nodemanager.log-aggregation.policy.class=org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AllContainerLogAggregationPolicy, yarn.client.application-client-protocol.poll-timeout-ms=-1, yarn.resourcemanager.webapp.ui-actions.enabled=true, yarn.sharedcache.client-server.address=0.0.0.0:8045, yarn.nodemanager.webapp.cross-origin.enabled=false, yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed=false, hadoop.security.instrumentation.requires.admin=false, io.compression.codec.bzip2.library=system-native, hadoop.ssl.keystores.factory.class=org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory, fs.ftp.host=0.0.0.0, mapreduce.task.exit.timeout=60000, yarn.app.mapreduce.am.containerlauncher.threadpool-initial-size=10, s3.blocksize=67108864, s3native.stream-buffer-size=4096, yarn.nodemanager.resource.memory-mb=122880, mapreduce.task.userlog.limit.kb=0, hadoop.security.crypto.codec.classes.aes.ctr.nopadding=org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec, org.apache.hadoop.crypto.JceAesCtrCryptoCodec, mapreduce.reduce.speculative=true, yarn.node-labels.fs-store.impl.class=org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore, hadoop.caller.context.max.size=128, dfs.replication=1, yarn.client.failover-retries=0, yarn.nodemanager.resource.cpu-vcores=16, mapreduce.jobhistory.recovery.enable=false, nfs.exports.allowed.hosts=* rw, yarn.sharedcache.checksum.algo.impl=org.apache.hadoop.yarn.sharedcache.ChecksumSHA256Impl, mapreduce.reduce.shuffle.memory.limit.percent=0.25, file.replication=1, mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.Shuffle, mapreduce.task.local-fs.write-limit.bytes=-1, yarn.app.mapreduce.am.log.level=INFO, mapreduce.job.jvm.numtasks=20, mapreduce.am.max-attempts=2, mapreduce.shuffle.connection-keep-alive.timeout=5, mapreduce.job.reduces=23, hadoop.security.group.mapping.ldap.connection.timeout.ms=60000, yarn.nodemanager.amrmproxy.client.thread-count=25, yarn.app.mapreduce.am.job.task.listener.thread-count=60, yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore, s3native.replication=3, mapreduce.tasktracker.reduce.tasks.maximum=3, fs.permissions.umask-mode=022, yarn.resourcemanager.node-ip-cache.expiry-interval-secs=-1, mapreduce.cluster.local.dir=/mnt/mapred, mapreduce.client.output.filter=FAILED, yarn.nodemanager.pmem-check-enabled=true, hadoop.proxyuser.httpfs.hosts=*, ftp.replication=3, hadoop.security.group.mapping.ldap.search.attr.member=member, fs.s3a.max.total.tasks=5, dfs.namenode.replication.work.multiplier.per.iteration=10, yarn.resourcemanager.fs.state-store.num-retries=0, yarn.timeline-service.leveldb-state-store.path=${hadoop.tmp.dir}/yarn/timeline, yarn.resourcemanager.resource-tracker.address=ip-10-63-114-58.corp.stateauto.com:8025, yarn.nodemanager.resource.pcores-vcores-multiplier=1.0, hadoop.security.token.service.use_ip=true, yarn.resourcemanager.scheduler.monitor.enable=false, fs.trash.checkpoint.interval=0, hadoop.registry.zk.retry.times=5, yarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size=10000, s3.stream-buffer-size=4096, fs.s3a.connection.maximum=15, hadoop.security.dns.log-slow-lookups.enabled=false, file.client-write-packet-size=65536, hadoop.shell.missing.defaultFs.warning=false, fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem, yarn.nodemanager.windows-container.memory-limit.enabled=false, yarn.nodemanager.remote-app-log-dir=/var/log/hadoop-yarn/apps, mapreduce.reduce.shuffle.retry-delay.max.ms=60000, io.map.index.interval=128, mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:/usr/lib/hadoop-lzo/lib/native, yarn.nodemanager.container-localizer.java.opts=-Xmx256m, javax.jdo.option.ConnectionUserName=hive, hadoop.ssl.server.conf=ssl-server.xml, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.StandardSocketFactory, yarn.minicluster.yarn.nodemanager.resource.memory-mb=4096, yarn.app.mapreduce.client.max-retries=3, yarn.nodemanager.address=${yarn.nodemanager.hostname}:8041, mapreduce.jobhistory.webapp.https.address=0.0.0.0:19890, yarn.resourcemanager.max-log-aggregation-diagnostics-in-memory=10, dfs.datanode.max.transfer.threads=4096, ha.failover-controller.graceful-fence.rpc-timeout.ms=5000, yarn.resourcemanager.delayed.delegation-token.removal-interval-ms=30000, yarn.timeline-service.enabled=false, yarn.app.mapreduce.am.jhs.backup.enabled=true, ipc.maximum.data.length=67108864, mapreduce.job.finish-when-all-reducers-done=false, hadoop.security.key.provider.path=kms://http@ip-10-63-114-58.corp.stateauto.com:9700/kms, hadoop.security.group.mapping.providers.combined=true, yarn.resourcemanager.decommissioning-nodes-watcher.poll-interval-secs=20, hadoop.security.groups.cache.warn.after.ms=5000, hadoop.security.auth_to_local=
      RULE:[1:$1@$0](.*@)s/@.*///L
      RULE:[2:$1@$0](.*@)s/@.*///L
      DEFAULT
    , io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec, yarn.resourcemanager.fs.state-store.retry-interval-ms=1000, yarn.resourcemanager.zk-acl=world:anyone:rwcda, yarn.nodemanager.resource-monitor.interval-ms=3000, yarn.nodemanager.resource.detect-hardware-capabilities=false, yarn.sharedcache.app-checker.class=org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker, yarn.timeline-service.entity-group-fs-store.retain-seconds=604800, yarn.nodemanager.webapp.https.address=0.0.0.0:8044, yarn.nodemanager.amrmproxy.enable=false, yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms=20, yarn.resourcemanager.fs.state-store.retry-policy-spec=2000, 500, fs.s3a.fast.upload=false, mapreduce.job.committer.setup.cleanup.needed=true, mapreduce.job.end-notification.retry.attempts=0, yarn.resourcemanager.state-store.max-completed-applications=${yarn.resourcemanager.max-completed-applications}, yarn.scheduler.increment-allocation-mb=32, mapreduce.map.output.compress=true, mapreduce.jobhistory.cleaner.enable=true, dfs.namenode.replication.max-streams=20, mapreduce.job.running.reduce.limit=0, io.seqfile.local.dir=${hadoop.tmp.dir}/io/local, mapreduce.reduce.shuffle.read.timeout=180000, mapreduce.job.queuename=default, dfs.namenode.rpc-address=ip-10-63-114-58.corp.stateauto.com:8020, ipc.client.connect.max.retries=10, yarn.app.mapreduce.am.staging-dir=/tmp/hadoop-yarn/staging, yarn.timeline-service.leveldb-timeline-store.read-cache-size=104857600, yarn.nodemanager.linux-container-executor.resources-handler.class=org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler, yarn.app.mapreduce.client.job.retry-interval=2000, io.file.buffer.size=65536, yarn.resourcemanager.webapp.cross-origin.enabled=false, yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs=86400, yarn.nodemanager.log.deletion-threads-count=4, ha.zookeeper.parent-znode=/hadoop-ha, tfile.io.chunk.size=1048576, yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms=10000, yarn.timeline-service.keytab=/etc/krb5.keytab, yarn.node-labels.enabled=true, fs.viewfs.rename.strategy=SAME_MOUNTPOINT, yarn.acl.enable=false, hadoop.security.group.mapping.ldap.directory.search.timeout=10000, mapreduce.application.classpath=
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
    /usr/lib/hadoop-lzo/lib/*,
    /usr/share/aws/emr/emrfs/conf,
    /usr/share/aws/emr/emrfs/lib/*,
    /usr/share/aws/emr/emrfs/auxlib/*,
    /usr/share/aws/emr/lib/*,
    /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,
    /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,
    /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar,
    /usr/share/aws/emr/cloudwatch-sink/lib/*,
    /usr/share/aws/aws-java-sdk/*
  , yarn.timeline-service.version=1.0f, mapreduce.job.token.tracking.ids.enabled=false, mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec, mapred.output.direct.NativeS3FileSystem=true, yarn.sharedcache.enabled=false, hadoop.proxyuser.hadoop.groups=*, s3.replication=3, yarn.timeline-service.http-authentication.type=simple, hadoop.registry.zk.root=/registry, tfile.fs.input.buffer.size=262144, ha.failover-controller.graceful-fence.connection.retries=1, net.topology.script.number.args=100, fs.s3n.multipart.uploads.block.size=67108864, yarn.sharedcache.admin.thread-count=1, yarn.nodemanager.recovery.dir=${hadoop.tmp.dir}/yarn-nm-recovery, hadoop.ssl.enabled=false, fs.AbstractFileSystem.ftp.impl=org.apache.hadoop.fs.ftp.FtpFs, yarn.timeline-service.handler-thread-count=10, yarn.nodemanager.container-metrics.unregister-delay-ms=10000, hadoop.caller.context.enabled=false, mapreduce.jobhistory.recovery.store.class=org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService, yarn.nodemanager.log.retain-seconds=10800, yarn.resourcemanager.admin.address=${yarn.resourcemanager.hostname}:8033, yarn.resourcemanager.recovery.enabled=false, yarn.resourcemanager.ha.automatic-failover.zk-base-path=/yarn-leader-election, fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem, fs.AbstractFileSystem.viewfs.impl=org.apache.hadoop.fs.viewfs.ViewFs, fs.AbstractFileSystem.hdfs.impl=org.apache.hadoop.fs.Hdfs, yarn.resourcemanager.reservation-system.enable=false, mapreduce.job.speculative.speculative-cap-total-tasks=0.01, yarn.timeline-service.generic-application-history.max-applications=10000, yarn.sharedcache.nm.uploader.thread-count=20, yarn.nodemanager.log-container-debug-info.enabled=false, fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A, yarn.resourcemanager.max-completed-applications=10000, hadoop.proxyuser.oozie.groups=*, yarn.nodemanager.log-dirs=/var/log/hadoop-yarn/containers, fs.s3.maxConnections=5000, yarn.resourcemanager.node-removal-untracked.timeout-ms=60000, yarn.nodemanager.linux-container-executor.nonsecure-mode.user-pattern=^[_.A-Za-z0-9][-@_.A-Za-z0-9]{0,255}?[$]?$, dfs.hosts.exclude=/emr/instance-controller/lib/dfs.hosts.exclude, ftp.blocksize=67108864, mapreduce.job.acl-modify-job= , fs.defaultFS=hdfs://ip-10-63-114-58.corp.stateauto.com:8020, hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory, yarn.nodemanager.node-labels.resync-interval-ms=120000, hadoop.http.filter.initializers=org.apache.hadoop.http.lib.StaticUserWebFilter, fs.s3n.multipart.copy.block.size=5368709120, mapreduce.map.java.opts=-Xmx6144m, fs.adl.impl=org.apache.hadoop.fs.adl.AdlFileSystem, fs.adl.oauth2.access.token.provider.type=ClientCredential, yarn.resourcemanager.connect.max-wait.ms=900000, yarn.timeline-service.entity-group-fs-store.scan-interval-seconds=60, hadoop.security.group.mapping.ldap.ssl=false, dfs.namenode.https-address=ip-10-63-114-58.corp.stateauto.com:50470, yarn.nodemanager.aux-services=mapreduce_shuffle,, yarn.intermediate-data-encryption.enable=false, yarn.sharedcache.store.class=org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore, yarn.fail-fast=false, yarn.resourcemanager.admin.client.thread-count=1, hadoop.security.kms.client.encrypted.key.cache.size=500, yarn.app.mapreduce.shuffle.log.separate=true, ipc.client.kill.max=10, hadoop.security.group.mapping.ldap.search.filter.group=(objectClass=group), fs.AbstractFileSystem.file.impl=org.apache.hadoop.fs.local.LocalFs, hadoop.http.authentication.kerberos.keytab=${user.home}/hadoop.keytab, yarn.client.nodemanager-connect.max-wait-ms=180000, mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$MapOutputBuffer, hadoop.security.uid.cache.secs=14400, mapreduce.map.cpu.vcores=1, yarn.log-aggregation.retain-check-interval-seconds=-1, mapreduce.map.log.level=INFO, mapred.child.java.opts=-Xmx200m, yarn.app.mapreduce.am.hard-kill-timeout-ms=10000, hadoop.registry.zk.session.timeout.ms=60000, mapreduce.job.running.map.limit=0, yarn.sharedcache.store.in-memory.initial-delay-mins=10, yarn.timeline-service.entity-group-fs-store.cleaner-interval-seconds=3600, yarn.sharedcache.client-server.thread-count=50, yarn.nodemanager.local-cache.max-files-per-directory=8192, s3native.blocksize=67108864, dfs.datanode.fsdataset.volume.choosing.policy=org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy, mapreduce.client.completion.pollinterval=5000, fs.s3a.socket.send.buffer=8192, mapreduce.job.maps=48, fs.AbstractFileSystem.swebhdfs.impl=org.apache.hadoop.fs.SWebHdfs, mapreduce.job.acl-view-job= , fs.s3a.readahead.range=64K, yarn.resourcemanager.connect.retry-interval.ms=30000, yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms=300000, fs.s3a.multipart.threshold=2147483647, mapreduce.shuffle.max.connections=0, hadoop.shell.safely.delete.limit.num.files=100, yarn.log-aggregation-enable=true, mapreduce.task.io.sort.factor=48, hadoop.security.dns.log-slow-lookups.threshold.ms=1000, ha.health-monitor.sleep-after-disconnect.ms=1000, ha.zookeeper.session-timeout.ms=10000, yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users=true, fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate, mapreduce.input.fileinputformat.list-status.num-threads=1, io.skip.checksum.errors=false, yarn.resourcemanager.scheduler.client.thread-count=64, dfs.namenode.safemode.extension=5000, mapreduce.jobhistory.move.thread-count=3, yarn.resourcemanager.zk-state-store.parent-path=/rmstore, yarn.timeline-service.client.fd-retain-secs=300, ipc.client.idlethreshold=4000, yarn.sharedcache.cleaner.initial-delay-mins=10, mapreduce.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s, mapreduce.jobhistory.keytab=/etc/security/keytab/jhs.service.keytab, yarn.scheduler.minimum-allocation-mb=32, yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs=86400, mapreduce.reduce.shuffle.fetch.retry.interval-ms=1000, yarn.timeline-service.entity-group-fs-store.app-cache-size=10, hadoop.user.group.static.mapping.overrides=dr.who=;, hadoop.security.kms.client.encrypted.key.cache.low-watermark=0.3f, yarn.dispatcher.exit-on-error=true, fs.s3a.connection.ssl.enabled=true, yarn.node-labels.fs-store.retry-policy-spec=2000, 500, yarn.nodemanager.runtime.linux.docker.capabilities=CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE, fs.AbstractFileSystem.webhdfs.impl=org.apache.hadoop.fs.WebHdfs, yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy, ipc.server.listen.queue.size=128, rpc.metrics.quantile.enable=false, yarn.nodemanager.resource.system-reserved-memory-mb=-1, yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=-1, yarn.client.nodemanager-client-async.thread-pool-max-size=500, hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback, yarn.resourcemanager.system-metrics-publisher.enabled=true, dfs.namenode.name.dir=/mnt/namenode, yarn.am.liveness-monitor.expiry-interval-ms=600000, yarn.nm.liveness-monitor.expiry-interval-ms=600000, ftp.bytes-per-checksum=512, yarn.sharedcache.nested-level=3, javax.jdo.option.ConnectionPassword=hive, mapreduce.job.emit-timeline-data=false, io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec, mapreduce.map.memory.mb=7680, yarn.client.nodemanager-connect.retry-interval-ms=10000, hadoop.http.cross-origin.max-age=1800, yarn.timeline-service.leveldb-timeline-store.start-time-read-cache-size=10000, yarn.scheduler.include-port-in-node-name=false, mapreduce.job.speculative.retry-after-no-speculate=1000, hadoop.registry.zk.connection.timeout.ms=15000, yarn.resourcemanager.address=ip-10-63-114-58.corp.stateauto.com:8032, ipc.client.rpc-timeout.ms=0, mapreduce.task.skip.start.attempts=2, fs.s3a.socket.recv.buffer=8192, yarn.resourcemanager.zk-timeout-ms=10000, yarn.timeline-service.entity-group-fs-store.summary-store=org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore, hadoop.security.groups.cache.background.reload.threads=3, hadoop.proxyuser.hive.groups=*, yarn.sharedcache.cleaner.resource-sleep-ms=0, yarn.nodemanager.runtime.linux.allowed-runtimes=default, mapreduce.map.skip.maxrecords=0, yarn.resourcemanager.system-metrics-publisher.dispatcher.pool-size=10, dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold=10737418240, mapreduce.jobtracker.system.dir=${hadoop.tmp.dir}/mapred/system, yarn.timeline-service.hostname=ip-10-63-114-58.corp.stateauto.com, hadoop.registry.rm.enabled=false, mapreduce.job.reducer.preempt.delay.sec=0, hadoop.proxyuser.oozie.hosts=*, mapred.output.committer.class=org.apache.hadoop.mapred.DirectFileOutputCommitter, hadoop.security.key.default.bitlength=256, yarn.node-labels.configuration-type=distributed, mapreduce.shuffle.ssl.enabled=false, yarn.nodemanager.vmem-pmem-ratio=5, yarn.nodemanager.container-manager.thread-count=64, hadoop.tmp.dir=/mnt/var/lib/hadoop/tmp, fs.AbstractFileSystem.har.impl=org.apache.hadoop.fs.HarFs, yarn.nodemanager.localizer.cache.target-size-mb=10240, yarn.app.mapreduce.shuffle.log.backups=0, yarn.minicluster.use-rpc=false, yarn.http.policy=HTTP_ONLY, yarn.timeline-service.webapp.https.address=${yarn.timeline-service.hostname}:8190, yarn.resourcemanager.amlauncher.thread-count=50, yarn.log.server.url=http://ip-10-63-114-58.corp.stateauto.com:19888/jobhistory/logs, tfile.fs.output.buffer.size=262144, fs.ftp.host.port=21, mapreduce.task.io.sort.mb=200, hadoop.security.group.mapping.ldap.search.attr.group.name=cn, yarn.nodemanager.amrmproxy.address=0.0.0.0:8048, hadoop.security.group.mapping.ldap.read.timeout.ms=60000, mapreduce.output.fileoutputformat.compress.type=BLOCK, file.bytes-per-checksum=512, mapreduce.job.userlog.retain.hours=48, mapreduce.reduce.java.opts=-Xmx12288m, ha.health-monitor.check-interval.ms=1000, yarn.resourcemanager.delegation.key.update-interval=86400000, yarn.resourcemanager.resource-tracker.client.thread-count=64, mapreduce.reduce.input.buffer.percent=0.0, yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage=false, ha.health-monitor.rpc-timeout.ms=45000, yarn.scheduler.maximum-allocation-mb=122880, yarn.resourcemanager.leveldb-state-store.path=${hadoop.tmp.dir}/yarn/system/rmstore, mapreduce.task.files.preserve.failedtasks=false, yarn.nodemanager.delete.thread-count=4, mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec, map.sort.class=org.apache.hadoop.util.QuickSort, yarn.nodemanager.resource.count-logical-processors-as-cores=false, mapreduce.jobhistory.jobname.limit=50, mapreduce.job.classloader=false, hadoop.registry.zk.retry.ceiling.ms=60000, io.seqfile.compress.blocksize=1000000, mapreduce.task.profile.maps=0-2, mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging, yarn.nodemanager.localizer.cache.cleanup.interval-ms=600000, hadoop.proxyuser.hive.hosts=*, hadoop.http.cross-origin.allowed-origins=*, yarn.timeline-service.client.fd-flush-interval-secs=10, hadoop.security.java.secure.random.algorithm=SHA1PRNG, fs.client.resolve.remote.symlinks=true, yarn.resourcemanager.delegation-token-renewer.thread-count=50, mapreduce.shuffle.listen.queue.size=128, yarn.nodemanager.disk-health-checker.min-healthy-disks=0.25, yarn.resourcemanager.nodes.exclude-path=/emr/instance-controller/lib/yarn.nodes.exclude.xml, mapreduce.job.end-notification.retry.interval=1000, mapreduce.jobhistory.loadedjobs.cache.size=5, fs.s3a.fast.upload.active.blocks=4, yarn.nodemanager.local-dirs=/mnt/yarn, mapreduce.task.exit.timeout.check-interval-ms=20000, yarn.timeline-service.webapp.address=${yarn.timeline-service.hostname}:8188, hadoop.registry.jaas.context=Client, mapreduce.jobhistory.address=ip-10-63-114-58.corp.stateauto.com:10020, ipc.server.log.slow.rpc=false, file.blocksize=67108864, yarn.sharedcache.cleaner.period-mins=1440, yarn.timeline-service.entity-group-fs-store.leveldb-cache-read-cache-size=10485760, fs.s3a.block.size=32M, hadoop.security.kms.client.failover.sleep.max.millis=2000, yarn.resourcemanager.metrics.runtime.buckets=60,300,1440, dfs.namenode.http-address=ip-10-63-114-58.corp.stateauto.com:50070, ipc.client.ping=true, yarn.resourcemanager.leveldb-state-store.compaction-interval-secs=3600, yarn.timeline-service.http-cross-origin.enabled=true, yarn.node-labels.am.default-node-label-expression=CORE, yarn.resourcemanager.configuration.provider-class=org.apache.hadoop.yarn.LocalConfigurationProvider, yarn.nodemanager.recovery.enabled=true, yarn.resourcemanager.hostname=10.63.114.58, fs.s3n.multipart.uploads.enabled=true, yarn.nodemanager.disk-health-checker.enable=true, mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem=2, yarn.nodemanager.amrmproxy.interceptor-class.pipeline=org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor, ha.failover-controller.cli-check.rpc-timeout.ms=20000, hadoop.proxyuser.presto.hosts=*, ftp.client-write-packet-size=65536, mapreduce.reduce.shuffle.parallelcopies=20, hadoop.caller.context.signature.max.size=40, mapreduce.jobhistory.principal=jhs/_HOST@REALM.TLD, hadoop.http.authentication.simple.anonymous.allowed=true, yarn.log-aggregation.retain-seconds=172800, yarn.resourcemanager.rm.container-allocation.expiry-interval-ms=600000, yarn.nodemanager.windows-container.cpu-limit.enabled=false, yarn.timeline-service.http-authentication.simple.anonymous.allowed=true, hadoop.security.kms.client.failover.sleep.base.millis=100, mapreduce.jobhistory.jhist.format=json, yarn.resourcemanager.reservation-system.planfollower.time-step=1000, mapreduce.job.ubertask.maxreduces=1, fs.s3a.connection.establish.timeout=5000, yarn.nodemanager.health-checker.interval-ms=600000, fs.s3a.multipart.purge=false, hadoop.security.kms.client.encrypted.key.cache.num.refill.threads=2, fs.AbstractFileSystem.adl.impl=org.apache.hadoop.fs.adl.Adl, yarn.timeline-service.store-class=org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore, mapreduce.shuffle.transfer.buffer.size=131072, yarn.resourcemanager.zk-num-retries=1000, yarn.sharedcache.store.in-memory.staleness-period-mins=10080, yarn.nodemanager.webapp.address=${yarn.nodemanager.hostname}:8042, yarn.app.mapreduce.client-am.ipc.max-retries=3, ipc.ping.interval=60000, ha.failover-controller.new-active.rpc-timeout.ms=60000, mapreduce.jobhistory.client.thread-count=10, fs.trash.interval=0, mapreduce.fileoutputcommitter.algorithm.version=1, mapreduce.reduce.skip.maxgroups=0, mapreduce.reduce.memory.mb=15360, yarn.nodemanager.health-checker.script.timeout-ms=1200000, dfs.datanode.du.reserved=536870912, mapreduce.client.progressmonitor.pollinterval=1000, yarn.resourcemanager.delegation.token.renew-interval=86400000, yarn.nodemanager.hostname=0.0.0.0, yarn.resourcemanager.ha.enabled=false, yarn.scheduler.minimum-allocation-vcores=1, yarn.app.mapreduce.am.container.log.limit.kb=0, hadoop.http.authentication.signature.secret.file=${user.home}/hadoop-http-auth-signature-secret, mapreduce.jobhistory.move.interval-ms=180000, yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs=86400, yarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor, hadoop.security.authorization=false, yarn.nodemanager.node-labels.provider=config, yarn.nodemanager.localizer.address=${yarn.nodemanager.hostname}:8040, mapreduce.jobhistory.recovery.store.fs.uri=${hadoop.tmp.dir}/mapred/history/recoverystore, hive.metastore.warehouse.dir=hdfs:///user/spark/warehouse, mapreduce.shuffle.connection-keep-alive.enable=false, hadoop.common.configuration.version=0.23.0, yarn.app.mapreduce.task.container.log.backups=0, hadoop.security.groups.negative-cache.secs=30, mapreduce.ifile.readahead=true, hadoop.security.kms.client.timeout=60, yarn.nodemanager.resource.percentage-physical-cpu-limit=100, mapreduce.job.max.split.locations=10, hadoop.registry.zk.quorum=localhost:2181, fs.s3a.threads.keepalivetime=60, fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem, mapreduce.jobhistory.joblist.cache.size=20000, mapreduce.job.end-notification.max.attempts=5, hadoop.security.groups.cache.background.reload=false, mapreduce.reduce.shuffle.connect.timeout=180000, mapreduce.jobhistory.webapp.address=ip-10-63-114-58.corp.stateauto.com:19888, fs.s3a.connection.timeout=200000, yarn.sharedcache.nm.uploader.replication.factor=10, hadoop.http.authentication.token.validity=36000, ipc.client.connect.max.retries.on.timeouts=5, yarn.timeline-service.client.internal-timers-ttl-secs=420, yarn.nodemanager.docker-container-executor.exec-name=/usr/bin/docker, yarn.app.mapreduce.am.job.committer.cancel-timeout=60000, dfs.ha.fencing.ssh.connect-timeout=30000, mapreduce.reduce.log.level=INFO, mapreduce.reduce.shuffle.merge.percent=0.66, ipc.client.fallback-to-simple-auth-allowed=false, io.serializations=org.apache.hadoop.io.serializer.WritableSerialization, org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, fs.s3.block.size=67108864, yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user=nobody, hadoop.kerberos.kinit.command=kinit, hadoop.security.kms.client.encrypted.key.cache.expiry=43200000, yarn.resourcemanager.fs.state-store.uri=${hadoop.tmp.dir}/yarn/system/rmstore, yarn.dispatcher.drain-events.timeout=300000, yarn.admin.acl=*, mapreduce.reduce.merge.inmem.threshold=1000, yarn.cluster.max-application-priority=0, net.topology.impl=org.apache.hadoop.net.NetworkTopology, yarn.resourcemanager.ha.automatic-failover.enabled=true, yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler, io.map.index.skip=0, dfs.namenode.handler.count=64, yarn.resourcemanager.webapp.https.address=${yarn.resourcemanager.hostname}:8090, yarn.nodemanager.admin-env=MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX, hadoop.security.crypto.cipher.suite=AES/CTR/NoPadding, mapreduce.task.profile.map.params=${mapreduce.task.profile.params}, hadoop.security.crypto.buffer.size=8192, yarn.nodemanager.aux-services.mapreduce_shuffle.class=org.apache.hadoop.mapred.ShuffleHandler, yarn.nodemanager.container-metrics.enable=false, fs.s3a.path.style.access=false, mapreduce.cluster.acls.enabled=false, yarn.sharedcache.uploader.server.address=0.0.0.0:8046, yarn.log-aggregation-status.time-out.ms=600000, fs.s3a.threads.max=10, fs.har.impl.disable.cache=true, mapreduce.tasktracker.map.tasks.maximum=3, ipc.client.connect.timeout=20000, yarn.nodemanager.remote-app-log-dir-suffix=logs, fs.df.interval=60000, hadoop.util.hash.type=murmur, mapreduce.jobhistory.minicluster.fixed.ports=false, yarn.app.mapreduce.shuffle.log.limit.kb=0, yarn.timeline-service.entity-group-fs-store.done-dir=/tmp/entity-file-history/done/, ha.zookeeper.acl=world:anyone:rwcda, yarn.resourcemanager.delegation.token.max-lifetime=604800000, mapreduce.job.speculative.speculative-cap-running-tasks=0.1, mapreduce.map.sort.spill.percent=0.80, yarn.nodemanager.recovery.supervised=true, file.stream-buffer-size=4096, yarn.resourcemanager.ha.automatic-failover.embedded=true, hive.metastore.uris=thrift://ip-10-63-114-58.corp.stateauto.com:9083, yarn.resourcemanager.nodemanager.minimum.version=NONE, yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size=10, yarn.sharedcache.webapp.address=0.0.0.0:8788, yarn.app.mapreduce.am.resource.mb=15360, mapreduce.framework.name=yarn, mapreduce.job.reduce.slowstart.completedmaps=0.05, yarn.resourcemanager.client.thread-count=64, hadoop.proxyuser.presto.groups=*, mapreduce.cluster.temp.dir=${hadoop.tmp.dir}/mapred/temp, mapreduce.jobhistory.intermediate-done-dir=${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate, fs.s3a.attempts.maximum=20}], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c]
19/10/15 03:05:36 INFO HoodieTableConfig: Loading dataset properties from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties
19/10/15 03:05:36 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/hoodie.properties' for reading
19/10/15 03:05:36 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:36 INFO HoodieTableMetaClient: Loading Active commit timeline for s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy
19/10/15 03:05:36 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@6b874125
19/10/15 03:05:36 INFO HoodieHiveClient: Creating hive connection jdbc:hive2://ip-10-63-114-58.corp.stateauto.com:10000
19/10/15 03:05:36 INFO Utils: Supplied authorities: ip-10-63-114-58.corp.stateauto.com:10000
19/10/15 03:05:36 INFO Utils: Resolved authority: ip-10-63-114-58.corp.stateauto.com:10000
19/10/15 03:05:36 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://ip-10-63-114-58.corp.stateauto.com:10000
19/10/15 03:05:37 INFO HoodieHiveClient: Successfully established Hive connection to  jdbc:hive2://ip-10-63-114-58.corp.stateauto.com:10000
19/10/15 03:05:37 INFO metastore: Trying to connect to metastore with URI thrift://ip-10-63-114-58.corp.stateauto.com:9083
19/10/15 03:05:37 INFO metastore: Opened a connection to metastore, current connections: 1
19/10/15 03:05:37 INFO metastore: Connected to metastore.
19/10/15 03:05:37 INFO HiveSyncTool: Trying to sync hoodie table hudi_gwpl_pc_policy with base path s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy of type COPY_ON_WRITE
19/10/15 03:05:37 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/.hoodie/20191015030518.commit' for reading
19/10/15 03:05:37 INFO HoodieHiveClient: Reading schema from s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/2018-07-01/d069dd86-cd5b-44d7-b59a-cffb6afc3b1c-0_0-26-12148_20191015030518.parquet
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 431
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 420
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 401
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 384
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 423
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 462
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_16_piece0 on ip-10-63-114-58.corp.stateauto.com:43403 in memory (size: 58.5 KB, free: 912.1 MB)
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_16_piece0 on ip-10-63-114-115.corp.stateauto.com:36209 in memory (size: 58.5 KB, free: 1458.4 MB)
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_19_piece0 on ip-10-63-114-58.corp.stateauto.com:43403 in memory (size: 2.6 KB, free: 912.1 MB)
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_19_piece0 on ip-10-63-114-115.corp.stateauto.com:36209 in memory (size: 2.6 KB, free: 1458.4 MB)
19/10/15 03:05:37 INFO ContextCleaner: Cleaned accumulator 400
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_18_piece0 on ip-10-63-114-58.corp.stateauto.com:43403 in memory (size: 57.4 KB, free: 912.1 MB)
19/10/15 03:05:37 INFO BlockManagerInfo: Removed broadcast_18_piece0 on ip-10-63-114-114.corp.stateauto.com:42747 in memory (size: 57.4 KB, free: 1458.5 MB)
19/10/15 03:05:37 INFO S3NativeFileSystem: Opening 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy/2018-07-01/d069dd86-cd5b-44d7-b59a-cffb6afc3b1c-0_0-26-12148_20191015030518.parquet' for reading
19/10/15 03:05:38 INFO HiveSyncTool: Table hudi_gwpl_pc_policy is not found. Creating it
19/10/15 03:05:38 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS uat_hoodie_staging.hudi_gwpl_pc_policy( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `deleteTime` bigint, `NewProducerCode_Ext` bigint, `DoNotPurge` boolean, `PublicID` string, `PriorPremiums` string, `IssueDate` bigint, `PriorPremiums_cur` int, `MovedPolicySourceAccountID` bigint, `AccountID` bigint, `CreateTime` bigint, `LossHistoryType` int, `ExcludedFromArchive` boolean, `ArchiveState` int, `ArchiveSchemaInfo` bigint, `ArchiveFailureDetailsID` bigint, `PackageRisk` int, `NumPriorLosses` int, `UpdateTime` bigint, `PrimaryLanguage` int, `DoNotArchive` boolean, `ID` bigint, `PrimaryLocale` int, `ProductCode` string, `ExcludeReason` string, `CreateUserID` bigint, `ArchiveFailureID` bigint, `OriginalEffectiveDate` bigint, `BeanVersion` int, `ArchivePartition` bigint, `Retired` bigint, `LossHistoryType_Ext` int, `UpdateUserID` bigint, `PriorTotalIncurred` string, `ArchiveDate` bigint, `PriorTotalIncurred_cur` int, `ProducerCodeOfServiceID` bigint, `UL_BOPEligibility_Ext` boolean, `isDmvReported` boolean, `ClueStatusExt` boolean, `LossHistoryTypeComm_Ext` int, `ClueStatusDetail` bigint, `uniqueId` string, `pctl_archivestate_typecode` string, `pctl_archivestate_name` string, `pctl_archivestate_description` string, `pctl_losshistorytype_typecode2` string, `pctl_losshistorytype_name2` string, `pctl_losshistorytype_description2` string, `pctl_losshistorytype_typecode1` string, `pctl_losshistorytype_name1` string, `pctl_losshistorytype_description1` string, `pctl_losshistorytype_ext_typecode` string, `pctl_losshistorytype_ext_name` string, `pctl_losshistorytype_ext_description` string, `pctl_packagerisk_typecode` string, `pctl_packagerisk_name` string, `pctl_packagerisk_description` string, `pctl_languagetype_typecode` string, `pctl_languagetype_name` string, `pctl_languagetype_description` string, `pctl_localetype_typecode` string, `pctl_localetype_name` string, `pctl_localetype_description` string, `pctl_currency_typecode1` string, `pctl_currency_name1` string, `pctl_currency_description1` string, `pctl_currency_typecode2` string, `pctl_currency_name2` string, `pctl_currency_description2` string, `ingestiondt` string) PARTITIONED BY (`batch` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy'
19/10/15 03:05:38 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS uat_hoodie_staging.hudi_gwpl_pc_policy( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `deleteTime` bigint, `NewProducerCode_Ext` bigint, `DoNotPurge` boolean, `PublicID` string, `PriorPremiums` string, `IssueDate` bigint, `PriorPremiums_cur` int, `MovedPolicySourceAccountID` bigint, `AccountID` bigint, `CreateTime` bigint, `LossHistoryType` int, `ExcludedFromArchive` boolean, `ArchiveState` int, `ArchiveSchemaInfo` bigint, `ArchiveFailureDetailsID` bigint, `PackageRisk` int, `NumPriorLosses` int, `UpdateTime` bigint, `PrimaryLanguage` int, `DoNotArchive` boolean, `ID` bigint, `PrimaryLocale` int, `ProductCode` string, `ExcludeReason` string, `CreateUserID` bigint, `ArchiveFailureID` bigint, `OriginalEffectiveDate` bigint, `BeanVersion` int, `ArchivePartition` bigint, `Retired` bigint, `LossHistoryType_Ext` int, `UpdateUserID` bigint, `PriorTotalIncurred` string, `ArchiveDate` bigint, `PriorTotalIncurred_cur` int, `ProducerCodeOfServiceID` bigint, `UL_BOPEligibility_Ext` boolean, `isDmvReported` boolean, `ClueStatusExt` boolean, `LossHistoryTypeComm_Ext` int, `ClueStatusDetail` bigint, `uniqueId` string, `pctl_archivestate_typecode` string, `pctl_archivestate_name` string, `pctl_archivestate_description` string, `pctl_losshistorytype_typecode2` string, `pctl_losshistorytype_name2` string, `pctl_losshistorytype_description2` string, `pctl_losshistorytype_typecode1` string, `pctl_losshistorytype_name1` string, `pctl_losshistorytype_description1` string, `pctl_losshistorytype_ext_typecode` string, `pctl_losshistorytype_ext_name` string, `pctl_losshistorytype_ext_description` string, `pctl_packagerisk_typecode` string, `pctl_packagerisk_name` string, `pctl_packagerisk_description` string, `pctl_languagetype_typecode` string, `pctl_languagetype_name` string, `pctl_languagetype_description` string, `pctl_localetype_typecode` string, `pctl_localetype_name` string, `pctl_localetype_description` string, `pctl_currency_typecode1` string, `pctl_currency_name1` string, `pctl_currency_description1` string, `pctl_currency_typecode2` string, `pctl_currency_name2` string, `pctl_currency_description2` string, `ingestiondt` string) PARTITIONED BY (`batch` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy'
19/10/15 03:05:41 INFO HiveSyncTool: Schema sync complete. Syncing partitions for hudi_gwpl_pc_policy
19/10/15 03:05:41 INFO HiveSyncTool: Last commit time synced was found to be null
19/10/15 03:05:41 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy,FS :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c
19/10/15 03:05:41 INFO HiveSyncTool: Storage partitions scan complete. Found 1
org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table hudi_gwpl_pc_policy
  at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:172)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:107)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:67)
  at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
  ... 70 elided
Caused by: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: uat_hoodie_staging.hudi_gwpl_pc_policy table not found
  at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
  at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
  at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$get_partitions_result.read(ThriftHiveMetastore.java)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
  at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions(ThriftHiveMetastore.java:2377)
  at org.apache.hudi.org.apache.hadoop_hive.metastore.api.ThriftHiveMetastore$Client.get_partitions(ThriftHiveMetastore.java:2362)
  at org.apache.hudi.org.apache.hadoop_hive.metastore.HiveMetaStoreClient.listPartitions(HiveMetaStoreClient.java:1162)
  at org.apache.hudi.hive.HoodieHiveClient.scanTablePartitions(HoodieHiveClient.java:240)
  at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:162)
vinothchandar commented 4 years ago

What I see is that it created the table,

19/10/15 03:05:38 INFO HiveSyncTool: Table hudi_gwpl_pc_policy is not found. Creating it
19/10/15 03:05:38 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS uat_hoodie_staging.hudi_gwpl_pc_policy( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `deleteTime` bigint, `NewProducerCode_Ext` bigint, `DoNotPurge` boolean, `PublicID` string, `PriorPremiums` string, `IssueDate` bigint, `PriorPremiums_cur` int, `MovedPolicySourceAccountID` bigint, `AccountID` bigint, `CreateTime` bigint, `LossHistoryType` int, `ExcludedFromArchive` boolean, `ArchiveState` int, `ArchiveSchemaInfo` bigint, `ArchiveFailureDetailsID` bigint, `PackageRisk` int, `NumPriorLosses` int, `UpdateTime` bigint, `PrimaryLanguage` int, `DoNotArchive` boolean, `ID` bigint, `PrimaryLocale` int, `ProductCode` string, `ExcludeReason` string, `CreateUserID` bigint, `ArchiveFailureID` bigint, `OriginalEffectiveDate` bigint, `BeanVersion` int, `ArchivePartition` bigint, `Retired` bigint, `LossHistoryType_Ext` int, `UpdateUserID` bigint, `PriorTotalIncurred` string, `ArchiveDate` bigint, `PriorTotalIncurred_cur` int, `ProducerCodeOfServiceID` bigint, `UL_BOPEligibility_Ext` boolean, `isDmvReported` boolean, `ClueStatusExt` boolean, `LossHistoryTypeComm_Ext` int, `ClueStatusDetail` bigint, `uniqueId` string, `pctl_archivestate_typecode` string, `pctl_archivestate_name` string, `pctl_archivestate_description` string, `pctl_losshistorytype_typecode2` string, `pctl_losshistorytype_name2` string, `pctl_losshistorytype_description2` string, `pctl_losshistorytype_typecode1` string, `pctl_losshistorytype_name1` string, `pctl_losshistorytype_description1` string, `pctl_losshistorytype_ext_typecode` string, `pctl_losshistorytype_ext_name` string, `pctl_losshistorytype_ext_description` string, `pctl_packagerisk_typecode` string, `pctl_packagerisk_name` string, `pctl_packagerisk_description` string, `pctl_languagetype_typecode` string, `pctl_languagetype_name` string, `pctl_languagetype_description` string, `pctl_localetype_typecode` string, `pctl_localetype_name` string, `pctl_localetype_description` string, `pctl_currency_typecode1` string, `pctl_currency_name1` string, `pctl_currency_description1` string, `pctl_currency_typecode2` string, `pctl_currency_name2` string, `pctl_currency_description2` string, `ingestiondt` string) PARTITIONED BY (`batch` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy'

Found the 1 storage partition (i.e from your data on s3) to sync

19/10/15 03:05:41 INFO HiveSyncTool: Schema sync complete. Syncing partitions for hudi_gwpl_pc_policy
19/10/15 03:05:41 INFO HiveSyncTool: Last commit time synced was found to be null
19/10/15 03:05:41 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in s3://sa-l3-uat-emr-edl-processed/staging/hoodie/pc_policy,FS :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@36c6ad3c
19/10/15 03:05:41 INFO HiveSyncTool: Storage partitions scan complete. Found 1

But cannot find the table when trying to sync them

org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table hudi_gwpl_pc_policy
  at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:172)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:107)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:67)
...
Caused by: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: uat_hoodie_staging.hudi_gwpl_pc_policy table not found
...
 at org.apache.hudi.org.apache.hadoop_hive.metastore.HiveMetaStoreClient.listPartitions(HiveMetaStoreClient.java:1162)
  at org.apache.hudi.hive.HoodieHiveClient.scanTablePartitions(HoodieHiveClient.java:240)
  at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:162)

It almost seems like your metastore is not providing read-after-write consistency? what is the Hive metastore backed by, s3? I am guessing glue catalog is different from Hive metastore? Could you give it a shot on EMR with Hive metastore?

vinothchandar commented 4 years ago

@umehrot2 any ideas, top of your heaD?

gfn9cho commented 4 years ago

We are using AWS Gluecatalog as the external hive metastore. The regular spark job is able to create, write & read the tables. I guess, HiveSyncTool is not able to read from the glue catalog. as everytime, we run the job its creating the table again as it is not found. Is there a chance that HiveSyncTool is overriding the config and thus its not able to see the hive metastore as a external one. 19/10/15 03:05:38 INFO HiveSyncTool: Table hudi_gwpl_pc_policy is not found. Creating it.

umehrot2 commented 4 years ago

@gfn9cho @vinothchandar we are aware of this issue in Hudi. It does not currently work with glue-catalog. We have solved it on our side, and will be pushing a PR soon.

For your information, here is explanation:

That is why your table gets created in Glue metastore, but while reading or scanning partitions it is talking to the local hive metastore where it does not find the table created.

gfn9cho commented 4 years ago

@vinothchandar @umehrot2 , Thanks for the support and a great explanation. I will wait for the updated code to take it forward.

vinothchandar commented 4 years ago

Thanks for excellent explanation! much appreciated @umehrot2

gfn9cho commented 4 years ago

@umehrot2 @vinothchandar I did incorporated the changes on my end and I could see the hive table for created with data synced. But then, while doing update(Append/Overwrite) to the same table, I am getting the below error, Not sure I am missing something at my end, but wanted to bring it to your notice if its really an issue.

org.apache.hudi.hive.HoodieHiveSyncException: Failed to get table schema for hudi_new_gwpl_pc_policy_2
  at org.apache.hudi.hive.HoodieHiveClient.getTableSchema(HoodieHiveClient.java:289)
  at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:144)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:95)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:67)
  at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
  ... 67 elided
Caused by: org.apache.hive.service.cli.HiveSQLException: java.lang.NullPointerException
  at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
  at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:247)
  at org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:220)
  at org.apache.hudi.hive.HoodieHiveClient.getTableSchema(HoodieHiveClient.java:276)
  ... 94 more
Caused by: org.apache.hive.service.cli.HiveSQLException: java.lang.NullPointerException
  at org.apache.hive.service.cli.operation.GetColumnsOperation.runInternal(GetColumnsOperation.java:213)
  at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
  at org.apache.hive.service.cli.session.HiveSessionImpl.getColumns(HiveSessionImpl.java:678)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
  at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
  at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
  at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
  at com.sun.proxy.$Proxy41.getColumns(Unknown Source)
  at org.apache.hive.service.cli.CLIService.getColumns(CLIService.java:385)
  at org.apache.hive.service.cli.thrift.ThriftCLIService.GetColumns(ThriftCLIService.java:622)
  at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetColumns.getResult(TCLIService.java:1557)
  at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetColumns.getResult(TCLIService.java:1542)
  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
  at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
  at org.apache.hive.service.cli.operation.GetColumnsOperation.runInternal(GetColumnsOperation.java:173)
  ... 25 more
firecast commented 4 years ago

I also faced the same issue although I was using a remote hive instance instead of the AWS Glue Data Catalog. A quick fix I did to fix the issue were the following

  1. https://github.com/apache/incubator-hudi/blob/ed745dfdbf254bfc2ec6d9c7baed8ccbf571abab/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L169 to
    syncHive(basePath, fs, parameters, sqlContext)
  2. https://github.com/apache/incubator-hudi/blob/ed745dfdbf254bfc2ec6d9c7baed8ccbf571abab/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L231 to
    private def syncHive(basePath: Path, fs: FileSystem, parameters: Map[String, String], sqlContext: SQLContext): Boolean = {
  3. Add the following lines before this line https://github.com/apache/incubator-hudi/blob/ed745dfdbf254bfc2ec6d9c7baed8ccbf571abab/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L235
    val hiveMetastoreURIs = sqlContext.sparkSession.conf.get(ConfVars.METASTOREURIS.varname)
    hiveConf.setVar(ConfVars.METASTOREURIS, hiveMetastoreURIs)

What this basically does is add the thrift URI you have set while creating the Spark Session to the Hive configuration. A temporary solution if anyone has a similar spark configuration as mine.

vinothchandar commented 4 years ago

@firecast can we open a JIRA & also open a PR around this? What do you think the root problem here was.

firecast commented 4 years ago

@vinothchandar Sure. Currently busy with a deployment at my company. I'll create a PR tomorrow and do some more research on it as well.

umehrot2 commented 4 years ago

@gfn9cho @vinothchandar We are aware of this issue as well.

This issue happens inside hive code at this line: https://github.com/apache/hive/blob/rel/release-2.3.6/service/src/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java#L173

This is because primaryKeys is turning up as null, and this is happening because of a bug in our internal glue catalog client, where if the primary key is not defined it returns null instead of an empty list. That breaks this piece of code.

EMR will be adding Hudi support as an app in its next release, where we will be fixing this bug in the glue client as well. Then your updates will succeed as well. Until then, may be you can try defining a primary key to get around this bug ?

gfn9cho commented 4 years ago

@umehrot2 @vinothchandar , Are you able to add a primary key. I am getting below error, createTableWithConstraints is not supported or addPrimaryKey is not supported. Looks like it's not yet implemented in AWS Glue catalog as the code in there just throws this exception. Do we know approximate ETA on when this will be available in EMR. It seems a deadlock. On one end, ddl to add constraint is not supported and on the other, its checking for primary key and returning NULL. Please let me know if there are any other workaround available. Can it be handled here to catch exception and return empty list, https://github.com/apache/hive/blob/rel/release-2.3.6/service/src/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java#L171

umehrot2 commented 4 years ago

@gfn9cho you are right Glue Catalog does not support Primary Key. Its not actually a problem with Glue service, but its EMR's glue client implementation that returns a null because primary key is not supported. Hive is not able to deal with it correctly.

At this point, we cannot just give you something that would make it work. Please not, at this point Hudi is not an officially support application on EMR. It should be supported by mid/end of November, which is when this issue will be fixed in EMRs side as well.

If you cannot wait until then, here is one way to unblock yourself:

gfn9cho commented 4 years ago

@umehrot2 Thanks much for your inputs. Looks a bit tricky so as to manage the versions of all the dependencies involved with glue client. I am working on getting approvals at work to try this on a mock environment and keep this post informed.

vinothchandar commented 4 years ago

Closing due to inactivity. Please reopen if needed

jinshuangxian commented 4 years ago

@gfn9cho you are right Glue Catalog does not support Primary Key. Its not actually a problem with Glue service, but its EMR's glue client implementation that returns a null because primary key is not supported. Hive is not able to deal with it correctly. Hi, is there a new version of aws-glue-data-catalog-client-for-apache-hive-metastore At this point, we cannot just give you something that would make it work. Please not, at this point Hudi is not an officially support application on EMR. It should be supported by mid/end of November, which is when this issue will be fixed in EMRs side as well.

If you cannot wait until then, here is one way to unblock yourself:

@gfn9cho you are right Glue Catalog does not support Primary Key. Its not actually a problem with Glue service, but its EMR's glue client implementation that returns a null because primary key is not supported. Hive is not able to deal with it correctly.

At this point, we cannot just give you something that would make it work. Please not, at this point Hudi is not an officially support application on EMR. It should be supported by mid/end of November, which is when this issue will be fixed in EMRs side as well.

If you cannot wait until then, here is one way to unblock yourself:

Hi, is there a new version of aws-glue-data-catalog-client-for-apache-hive-metastore?

vinothchandar commented 4 years ago

@umehrot2 Looks like glue integration keeps coming up :).. Do you want to chime in here? May be we good to also track some follow up from these issues (if any) towards the next release? Let me know if you are interested in driving this

umehrot2 commented 4 years ago

@jinshuangxian This particular issue has been fixed since our first release of Hudi on emr-5.28.0. So, you can use either emr-5.28.0 or emr-5.29.0 without this issue. Let me know if you are running into an actual issue.

umehrot2 commented 4 years ago

@vinothchandar I am happy to take up this integration piece. I have been keeping a close track on all the discussions around hive/glue catalog sync issues, and most of them have been around misconfigurations. One issue that is relevant is that schema evolution does now work against Glue catalog and I will create a JIRA for that.

I also think we can add some questions related to glue in the FAQ regarding misconfigurations. I can add it there, but would like to know the process for doing that.

vinothchandar commented 4 years ago

@umehrot2 For some of the misconfigs, we could add it to the troubleshooting guide that @pratyakshsharma is putting together.. This will reduce our support cost significantly ..

One issue that is relevant is that schema evolution does now work against Glue catalog and I will create a JIRA for that. +1. thanks @umehrot2 for being so awesome