[SUPPORT] DBT Merge creates duplicates

rshanmugam1 commented 1 year ago

this is my dbt model. if i run twice that creates duplicates. am i missing any obvious configurations?.

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'merge',
    file_format = 'hudi',
    options={
      'type': 'cow',
      'primaryKey': 'id',
    },
    unique_key = 'id',
) }}
{% if not is_incremental() %}

select cast(1 as bigint) as id, 'yo' as msg
union all
select  cast(2 as bigint) as id, 'anyway' as msg
union all
select  cast(3 as bigint) as id, 'yo' as msg

{% else %}

select  cast(1 as bigint) as id, 'yo2' as msg
union all
select cast(2 as bigint) as id, 'anyway1' as msg
union all
select  cast(3 as bigint) as id, 'yo3' as msg

{% endif %}

A clear and concise description of the problem. dbt merge model product duplicates.

To Reproduce

Steps to reproduce the behavior:

run the model twice.

Expected behavior duplicates should not present

Environment Description

Hudi version : 0.10.1
Spark version : 3.2.0
Hive version : 2.3.3
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context it is EMR emr-6.6.0 and running thrift server on it.

/usr/lib/spark/sbin/start-thriftserver.sh --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.hive.convertMetastoreParquet=false'  --conf 'spark.jars=hdfs:///apps/hudi/lib/hudi-spark-bundle.jar' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

table properties

hoodie.table.partition.fields=
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.version=3
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=test_merge_2
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:"long"},{"name"\:"msg","type"\:"string"}]}

Create Query

      create table dbt_tmp.test_merge_2
      using hudi
    options (type "cow" , primaryKey "id" 
    )
    as
 select cast(1 as bigint) as id, 'yo' as msg
union all
select  cast(2 as bigint) as id, 'anyway' as msg
union all
select  cast(3 as bigint) as id, 'yo' as msg

Merge second run


    create temporary view test_merge_2__dbt_tmp as

select  cast(1 as bigint) as id, 'yo2' as msg
union all
select cast(2 as bigint) as id, 'anyway1' as msg
union all
select  cast(3 as bigint) as id, 'yo3' as msg

  ' with 897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:18 INFO SparkExecuteStatementOperation: Running query with 897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:18 INFO SparkExecuteStatementOperation: Received getNextRowSet request order=FETCH_NEXT and maxRowsL=1000 with 897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:19 INFO SparkExecuteStatementOperation: Submitting query '/* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "spark_emr", "target_name": "prod_test", "node_id": "model.transform_spark.test_merge_2"
merge into dbt_tmp.test_merge_2 as DBT_INTERNAL_DEST
      using test_merge_2__dbt_tmp as DBT_INTERNAL_SOURCE
      on 
              DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id

      when matched then update set
         * 

      when not matched then insert *
' with bda690e6-a81b-4cc2-830a-0654e4f43f14
22/11/18 23:35:19 INFO SparkExecuteStatementOperation: Running query with bda690e6-a81b-4cc2-830a-0654e4f43f14
22/11/18 23:35:19 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:19 INFO DAGScheduler: Got job 165 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:19 INFO DAGScheduler: Final stage: ResultStage 216 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:19 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:19 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:19 INFO DAGScheduler: Submitting ResultStage 216 (MapPartitionsRDD[562] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_192 stored as values in memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_192_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_192_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:19 INFO SparkContext: Created broadcast 192 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:19 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 216 (MapPartitionsRDD[562] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:19 INFO YarnScheduler: Adding task set 216.0 with 1 tasks resource profile 0
22/11/18 23:35:19 INFO TaskSetManager: Starting task 0.0 in stage 216.0 (TID 330) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_192_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:19 INFO TaskSetManager: Finished task 0.0 in stage 216.0 (TID 330) in 99 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:19 INFO YarnScheduler: Removed TaskSet 216.0, whose tasks have all completed, from pool 
22/11/18 23:35:19 INFO DAGScheduler: ResultStage 216 (collect at HoodieSparkEngineContext.java:100) finished in 0.109 s
22/11/18 23:35:19 INFO DAGScheduler: Job 165 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:19 INFO YarnScheduler: Killing all running tasks in stage 216: Stage finished
22/11/18 23:35:19 INFO DAGScheduler: Job 165 finished: collect at HoodieSparkEngineContext.java:100, took 0.109971 s
22/11/18 23:35:19 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:19 INFO DAGScheduler: Got job 166 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:19 INFO DAGScheduler: Final stage: ResultStage 217 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:19 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:19 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:19 INFO DAGScheduler: Submitting ResultStage 217 (MapPartitionsRDD[564] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_193 stored as values in memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_193_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_193_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.7 MiB)
22/11/18 23:35:19 INFO SparkContext: Created broadcast 193 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:19 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 217 (MapPartitionsRDD[564] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:19 INFO YarnScheduler: Adding task set 217.0 with 1 tasks resource profile 0
22/11/18 23:35:19 INFO TaskSetManager: Starting task 0.0 in stage 217.0 (TID 331) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_193_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:19 INFO TaskSetManager: Finished task 0.0 in stage 217.0 (TID 331) in 84 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:19 INFO YarnScheduler: Removed TaskSet 217.0, whose tasks have all completed, from pool 
22/11/18 23:35:19 INFO DAGScheduler: ResultStage 217 (collect at HoodieSparkEngineContext.java:100) finished in 0.092 s
22/11/18 23:35:19 INFO DAGScheduler: Job 166 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:19 INFO YarnScheduler: Killing all running tasks in stage 217: Stage finished
22/11/18 23:35:19 INFO DAGScheduler: Job 166 finished: collect at HoodieSparkEngineContext.java:100, took 0.093236 s
22/11/18 23:35:20 INFO Javalin: 
           __                      __ _
          / /____ _ _   __ ____ _ / /(_)____
     __  / // __ `/| | / // __ `// // // __ \
    / /_/ // /_/ / | |/ // /_/ // // // / / /
    \____/ \__,_/  |___/ \__,_//_//_//_/ /_/

        https://javalin.io/documentation

22/11/18 23:35:20 INFO Javalin: Starting Javalin ...
22/11/18 23:35:20 INFO Javalin: Listening on http://localhost:44599/
22/11/18 23:35:20 INFO Javalin: Javalin started in 5ms \o/
22/11/18 23:35:20 INFO SparkContext: Starting job: countByKey at BaseSparkCommitActionExecutor.java:193
22/11/18 23:35:20 INFO DAGScheduler: Registering RDD 575 (countByKey at BaseSparkCommitActionExecutor.java:193) as input to shuffle 27
22/11/18 23:35:20 INFO DAGScheduler: Got job 167 (countByKey at BaseSparkCommitActionExecutor.java:193) with 3 output partitions
22/11/18 23:35:20 INFO DAGScheduler: Final stage: ResultStage 219 (countByKey at BaseSparkCommitActionExecutor.java:193)
22/11/18 23:35:20 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 218)
22/11/18 23:35:20 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 218)
22/11/18 23:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 218 (MapPartitionsRDD[575] at countByKey at BaseSparkCommitActionExecutor.java:193), which has no missing parents
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_194 stored as values in memory (estimated size 45.8 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_194_piece0 stored as bytes in memory (estimated size 19.6 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_194_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 19.6 KiB, free: 1048.7 MiB)
22/11/18 23:35:20 INFO SparkContext: Created broadcast 194 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:20 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 218 (MapPartitionsRDD[575] at countByKey at BaseSparkCommitActionExecutor.java:193) (first 15 tasks are for partitions Vector(0, 1, 2))
22/11/18 23:35:20 INFO YarnScheduler: Adding task set 218.0 with 3 tasks resource profile 0
22/11/18 23:35:20 INFO TaskSetManager: Starting task 0.0 in stage 218.0 (TID 332) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 1.0 in stage 218.0 (TID 333) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 2.0 in stage 218.0 (TID 334) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_194_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 19.6 KiB, free: 10.9 GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_2 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 170.0 B, free: 10.9 GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_1 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 174.0 B, free: 10.9 GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 170.0 B, free: 10.9 GiB)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 0.0 in stage 218.0 (TID 332) in 53 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 1.0 in stage 218.0 (TID 333) in 53 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 2.0 in stage 218.0 (TID 334) in 54 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:20 INFO YarnScheduler: Removed TaskSet 218.0, whose tasks have all completed, from pool 
22/11/18 23:35:20 INFO DAGScheduler: ShuffleMapStage 218 (countByKey at BaseSparkCommitActionExecutor.java:193) finished in 0.057 s
22/11/18 23:35:20 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:20 INFO DAGScheduler: running: Set()
22/11/18 23:35:20 INFO DAGScheduler: waiting: Set(ResultStage 219)
22/11/18 23:35:20 INFO DAGScheduler: failed: Set()
22/11/18 23:35:20 INFO DAGScheduler: Submitting ResultStage 219 (ShuffledRDD[576] at countByKey at BaseSparkCommitActionExecutor.java:193), which has no missing parents
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_195 stored as values in memory (estimated size 5.7 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_195_piece0 stored as bytes in memory (estimated size 3.3 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_195_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 3.3 KiB, free: 1048.7 MiB)
22/11/18 23:35:20 INFO SparkContext: Created broadcast 195 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:20 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 219 (ShuffledRDD[576] at countByKey at BaseSparkCommitActionExecutor.java:193) (first 15 tasks are for partitions Vector(0, 1, 2))
22/11/18 23:35:20 INFO YarnScheduler: Adding task set 219.0 with 3 tasks resource profile 0
22/11/18 23:35:20 INFO TaskSetManager: Starting task 1.0 in stage 219.0 (TID 335) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, NODE_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 0.0 in stage 219.0 (TID 336) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 2.0 in stage 219.0 (TID 337) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_195_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 3.3 KiB, free: 10.9 GiB)
22/11/18 23:35:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 27 to 172.31.22.163:53582
22/11/18 23:35:20 INFO TaskSetManager: Finished task 0.0 in stage 219.0 (TID 336) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 1.0 in stage 219.0 (TID 335) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 2.0 in stage 219.0 (TID 337) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:20 INFO YarnScheduler: Removed TaskSet 219.0, whose tasks have all completed, from pool 
22/11/18 23:35:20 INFO DAGScheduler: ResultStage 219 (countByKey at BaseSparkCommitActionExecutor.java:193) finished in 0.022 s
22/11/18 23:35:20 INFO DAGScheduler: Job 167 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:20 INFO YarnScheduler: Killing all running tasks in stage 219: Stage finished
22/11/18 23:35:20 INFO DAGScheduler: Job 167 finished: countByKey at BaseSparkCommitActionExecutor.java:193, took 0.081495 s
22/11/18 23:35:21 INFO SparkContext: Starting job: collect at SparkRejectUpdateStrategy.java:52
22/11/18 23:35:21 INFO DAGScheduler: Registering RDD 579 (distinct at SparkRejectUpdateStrategy.java:52) as input to shuffle 28
22/11/18 23:35:21 INFO DAGScheduler: Got job 168 (collect at SparkRejectUpdateStrategy.java:52) with 3 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 221 (collect at SparkRejectUpdateStrategy.java:52)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 220)
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 220)
22/11/18 23:35:21 INFO DAGScheduler: Submitting ShuffleMapStage 220 (MapPartitionsRDD[579] at distinct at SparkRejectUpdateStrategy.java:52), which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_196 stored as values in memory (estimated size 46.1 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_196_piece0 stored as bytes in memory (estimated size 19.7 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_196_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 19.7 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 196 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 220 (MapPartitionsRDD[579] at distinct at SparkRejectUpdateStrategy.java:52) (first 15 tasks are for partitions Vector(0, 1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 220.0 with 3 tasks resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 220.0 (TID 338) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 220.0 (TID 339) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 220.0 (TID 340) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_196_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 19.7 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 220.0 (TID 338) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 220.0 (TID 339) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 220.0 (TID 340) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 220.0, whose tasks have all completed, from pool 
22/11/18 23:35:21 INFO DAGScheduler: ShuffleMapStage 220 (distinct at SparkRejectUpdateStrategy.java:52) finished in 0.023 s
22/11/18 23:35:21 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:21 INFO DAGScheduler: running: Set()
22/11/18 23:35:21 INFO DAGScheduler: waiting: Set(ResultStage 221)
22/11/18 23:35:21 INFO DAGScheduler: failed: Set()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 221 (MapPartitionsRDD[581] at distinct at SparkRejectUpdateStrategy.java:52), which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_197 stored as values in memory (estimated size 6.5 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_197_piece0 stored as bytes in memory (estimated size 3.6 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_197_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 3.6 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 197 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 221 (MapPartitionsRDD[581] at distinct at SparkRejectUpdateStrategy.java:52) (first 15 tasks are for partitions Vector(0, 1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 221.0 with 3 tasks resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 221.0 (TID 341) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 221.0 (TID 342) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 221.0 (TID 343) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_197_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 3.6 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 28 to 172.31.22.163:53582
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_193_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 221.0 (TID 341) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 221.0 (TID 342) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_193_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 221.0 (TID 343) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 221.0, whose tasks have all completed, from pool 
22/11/18 23:35:21 INFO DAGScheduler: ResultStage 221 (collect at SparkRejectUpdateStrategy.java:52) finished in 0.028 s
22/11/18 23:35:21 INFO DAGScheduler: Job 168 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:21 INFO YarnScheduler: Killing all running tasks in stage 221: Stage finished
22/11/18 23:35:21 INFO DAGScheduler: Job 168 finished: collect at SparkRejectUpdateStrategy.java:52, took 0.053273 s
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_195_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 3.3 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_195_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 3.3 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_194_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 19.6 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_194_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 19.6 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_192_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_192_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_196_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 19.7 KiB, free: 1048.8 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_196_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 19.7 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO SparkContext: Starting job: collectAsMap at UpsertPartitioner.java:256
22/11/18 23:35:21 INFO DAGScheduler: Got job 169 (collectAsMap at UpsertPartitioner.java:256) with 1 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 222 (collectAsMap at UpsertPartitioner.java:256)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 222 (MapPartitionsRDD[583] at mapToPair at UpsertPartitioner.java:255), which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_198 stored as values in memory (estimated size 314.3 KiB, free 1048.5 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_198_piece0 stored as bytes in memory (estimated size 116.2 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_198_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 116.2 KiB, free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 198 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 222 (MapPartitionsRDD[583] at mapToPair at UpsertPartitioner.java:255) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 222.0 with 1 tasks resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 222.0 (TID 344) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4344 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_198_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 116.2 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 222.0 (TID 344) in 419 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 222.0, whose tasks have all completed, from pool 
22/11/18 23:35:21 INFO DAGScheduler: ResultStage 222 (collectAsMap at UpsertPartitioner.java:256) finished in 0.440 s
22/11/18 23:35:21 INFO DAGScheduler: Job 169 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:21 INFO YarnScheduler: Killing all running tasks in stage 222: Stage finished
22/11/18 23:35:21 INFO DAGScheduler: Job 169 finished: collectAsMap at UpsertPartitioner.java:256, took 0.440639 s
22/11/18 23:35:21 INFO SparkContext: Starting job: isEmpty at HoodieSparkSqlWriter.scala:657
22/11/18 23:35:21 INFO DAGScheduler: Registering RDD 584 (mapToPair at BaseSparkCommitActionExecutor.java:227) as input to shuffle 29
22/11/18 23:35:21 INFO DAGScheduler: Got job 170 (isEmpty at HoodieSparkSqlWriter.scala:657) with 1 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 224 (isEmpty at HoodieSparkSqlWriter.scala:657)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 223)
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 223)
22/11/18 23:35:21 INFO DAGScheduler: Submitting ShuffleMapStage 223 (MapPartitionsRDD[584] at mapToPair at BaseSparkCommitActionExecutor.java:227), which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_199 stored as values in memory (estimated size 349.0 KiB, free 1048.0 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_199_piece0 stored as bytes in memory (estimated size 130.1 KiB, free 1047.9 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_199_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 130.1 KiB, free: 1048.6 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 199 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 223 (MapPartitionsRDD[584] at mapToPair at BaseSparkCommitActionExecutor.java:227) (first 15 tasks are for partitions Vector(0, 1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 223.0 with 3 tasks resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 223.0 (TID 345) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 223.0 (TID 346) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 223.0 (TID 347) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2, PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_199_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 130.1 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 223.0 (TID 347) in 37 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 223.0 (TID 346) in 38 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 223.0 (TID 345) in 41 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 223.0, whose tasks have all completed, from pool 
22/11/18 23:35:21 INFO DAGScheduler: ShuffleMapStage 223 (mapToPair at BaseSparkCommitActionExecutor.java:227) finished in 0.063 s
22/11/18 23:35:21 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:21 INFO DAGScheduler: running: Set()
22/11/18 23:35:21 INFO DAGScheduler: waiting: Set(ResultStage 224)
22/11/18 23:35:21 INFO DAGScheduler: failed: Set()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 224 (MapPartitionsRDD[589] at filter at HoodieSparkSqlWriter.scala:657), which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_200 stored as values in memory (estimated size 456.3 KiB, free 1047.5 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_200_piece0 stored as bytes in memory (estimated size 171.1 KiB, free 1047.3 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_200_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 171.1 KiB, free: 1048.4 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 200 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 224 (MapPartitionsRDD[589] at filter at HoodieSparkSqlWriter.scala:657) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 224.0 with 1 tasks resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 224.0 (TID 348) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, NODE_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_200_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 171.1 KiB, free: 10.9 GiB)
22/11/18 23:35:21 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 29 to 172.31.22.163:53582
22/11/18 23:35:23 INFO BlockManagerInfo: Added rdd_588_0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 367.0 B, free: 10.9 GiB)
22/11/18 23:35:23 INFO TaskSetManager: Finished task 0.0 in stage 224.0 (TID 348) in 1911 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:23 INFO YarnScheduler: Removed TaskSet 224.0, whose tasks have all completed, from pool 
22/11/18 23:35:23 INFO DAGScheduler: ResultStage 224 (isEmpty at HoodieSparkSqlWriter.scala:657) finished in 1.938 s
22/11/18 23:35:23 INFO DAGScheduler: Job 170 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:23 INFO YarnScheduler: Killing all running tasks in stage 224: Stage finished
22/11/18 23:35:23 INFO DAGScheduler: Job 170 finished: isEmpty at HoodieSparkSqlWriter.scala:657, took 2.003411 s
22/11/18 23:35:23 INFO SparkContext: Starting job: collect at SparkRDDWriteClient.java:124
22/11/18 23:35:23 INFO DAGScheduler: Got job 171 (collect at SparkRDDWriteClient.java:124) with 1 output partitions
22/11/18 23:35:23 INFO DAGScheduler: Final stage: ResultStage 226 (collect at SparkRDDWriteClient.java:124)
22/11/18 23:35:23 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 225)
22/11/18 23:35:23 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:23 INFO DAGScheduler: Submitting ResultStage 226 (MapPartitionsRDD[590] at map at SparkRDDWriteClient.java:124), which has no missing parents
22/11/18 23:35:23 INFO MemoryStore: Block broadcast_201 stored as values in memory (estimated size 456.8 KiB, free 1046.8 MiB)
22/11/18 23:35:23 INFO MemoryStore: Block broadcast_201_piece0 stored as bytes in memory (estimated size 171.3 KiB, free 1046.7 MiB)
22/11/18 23:35:23 INFO BlockManagerInfo: Added broadcast_201_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 171.3 KiB, free: 1048.2 MiB)
22/11/18 23:35:23 INFO SparkContext: Created broadcast 201 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 226 (MapPartitionsRDD[590] at map at SparkRDDWriteClient.java:124) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:23 INFO YarnScheduler: Adding task set 226.0 with 1 tasks resource profile 0
22/11/18 23:35:23 INFO TaskSetManager: Starting task 0.0 in stage 226.0 (TID 349) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:23 INFO BlockManagerInfo: Added broadcast_201_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 171.3 KiB, free: 10.9 GiB)
22/11/18 23:35:23 INFO TaskSetManager: Finished task 0.0 in stage 226.0 (TID 349) in 30 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:23 INFO YarnScheduler: Removed TaskSet 226.0, whose tasks have all completed, from pool 
22/11/18 23:35:23 INFO DAGScheduler: ResultStage 226 (collect at SparkRDDWriteClient.java:124) finished in 0.058 s
22/11/18 23:35:23 INFO DAGScheduler: Job 171 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:23 INFO YarnScheduler: Killing all running tasks in stage 226: Stage finished
22/11/18 23:35:23 INFO DAGScheduler: Job 171 finished: collect at SparkRDDWriteClient.java:124, took 0.059674 s
22/11/18 23:35:24 INFO SparkContext: Starting job: collectAsMap at HoodieSparkEngineContext.java:148
22/11/18 23:35:24 INFO DAGScheduler: Got job 172 (collectAsMap at HoodieSparkEngineContext.java:148) with 2 output partitions
22/11/18 23:35:24 INFO DAGScheduler: Final stage: ResultStage 227 (collectAsMap at HoodieSparkEngineContext.java:148)
22/11/18 23:35:24 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:24 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:24 INFO DAGScheduler: Submitting ResultStage 227 (MapPartitionsRDD[592] at mapToPair at HoodieSparkEngineContext.java:145), which has no missing parents
22/11/18 23:35:24 INFO MemoryStore: Block broadcast_202 stored as values in memory (estimated size 106.2 KiB, free 1046.6 MiB)
22/11/18 23:35:24 INFO MemoryStore: Block broadcast_202_piece0 stored as bytes in memory (estimated size 39.7 KiB, free 1046.5 MiB)
22/11/18 23:35:24 INFO BlockManagerInfo: Added broadcast_202_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 39.7 KiB, free: 1048.2 MiB)
22/11/18 23:35:24 INFO SparkContext: Created broadcast 202 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:24 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 227 (MapPartitionsRDD[592] at mapToPair at HoodieSparkEngineContext.java:145) (first 15 tasks are for partitions Vector(0, 1))
22/11/18 23:35:24 INFO YarnScheduler: Adding task set 227.0 with 2 tasks resource profile 0
22/11/18 23:35:24 INFO TaskSetManager: Starting task 0.0 in stage 227.0 (TID 350) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/11/18 23:35:24 INFO TaskSetManager: Starting task 1.0 in stage 227.0 (TID 351) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1, PROCESS_LOCAL, 4433 bytes) taskResourceAssignments Map()
22/11/18 23:35:24 INFO BlockManagerInfo: Added broadcast_202_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 39.7 KiB, free: 10.9 GiB)
22/11/18 23:35:24 INFO TaskSetManager: Finished task 0.0 in stage 227.0 (TID 350) in 240 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/2)
22/11/18 23:35:24 INFO TaskSetManager: Finished task 1.0 in stage 227.0 (TID 351) in 240 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/2)
22/11/18 23:35:24 INFO YarnScheduler: Removed TaskSet 227.0, whose tasks have all completed, from pool 
22/11/18 23:35:24 INFO DAGScheduler: ResultStage 227 (collectAsMap at HoodieSparkEngineContext.java:148) finished in 0.249 s
22/11/18 23:35:24 INFO DAGScheduler: Job 172 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:24 INFO YarnScheduler: Killing all running tasks in stage 227: Stage finished
22/11/18 23:35:24 INFO DAGScheduler: Job 172 finished: collectAsMap at HoodieSparkEngineContext.java:148, took 0.250474 s
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_202_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 39.7 KiB, free: 1048.2 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_202_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 39.7 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_197_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 3.6 KiB, free: 1048.2 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_197_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 3.6 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_198_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 116.2 KiB, free: 1048.3 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_198_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 116.2 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_199_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 130.1 KiB, free: 1048.5 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_199_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 130.1 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_200_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 171.1 KiB, free: 1048.6 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_200_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 171.1 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_201_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 171.3 KiB, free: 1048.8 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_201_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 171.3 KiB, free: 10.9 GiB)
22/11/18 23:35:25 INFO MapPartitionsRDD: Removing RDD 588 from persistence list
22/11/18 23:35:25 INFO MapPartitionsRDD: Removing RDD 573 from persistence list
22/11/18 23:35:25 INFO BlockManager: Removing RDD 588
22/11/18 23:35:25 INFO BlockManager: Removing RDD 573
22/11/18 23:35:25 INFO metastore: Trying to connect to metastore with URI thrift://hive-writable-metastore.prod.branch.io:9083
22/11/18 23:35:25 INFO metastore: Opened a connection to metastore, current connections: 1
22/11/18 23:35:25 INFO metastore: Connected to metastore.
22/11/18 23:35:26 INFO metastore: Closed a connection to metastore, current connections: 0
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 173 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 228 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 228 (MapPartitionsRDD[594] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_203 stored as values in memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_203_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_203_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 203 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 228 (MapPartitionsRDD[594] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 228.0 with 1 tasks resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 228.0 (TID 352) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_203_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 228.0 (TID 352) in 104 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 228.0, whose tasks have all completed, from pool 
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 228 (collect at HoodieSparkEngineContext.java:100) finished in 0.113 s
22/11/18 23:35:26 INFO DAGScheduler: Job 173 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage 228: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 173 finished: collect at HoodieSparkEngineContext.java:100, took 0.113273 s
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 174 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 229 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 229 (MapPartitionsRDD[596] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_204 stored as values in memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_204_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_204_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.7 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 204 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 229 (MapPartitionsRDD[596] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 229.0 with 1 tasks resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 229.0 (TID 353) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_204_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 229.0 (TID 353) in 79 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 229.0, whose tasks have all completed, from pool 
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 229 (collect at HoodieSparkEngineContext.java:100) finished in 0.087 s
22/11/18 23:35:26 INFO DAGScheduler: Job 174 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage 229: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 174 finished: collect at HoodieSparkEngineContext.java:100, took 0.088012 s
22/11/18 23:35:26 INFO Javalin: Stopping Javalin ...
22/11/18 23:35:26 INFO Javalin: Javalin has stopped
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 175 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 230 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 230 (MapPartitionsRDD[598] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_204_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_204_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_203_piece0 on ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_205 stored as values in memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_203_piece0 on ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_205_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_205_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.8 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 205 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 230 (MapPartitionsRDD[598] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 230.0 with 1 tasks resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 230.0 (TID 354) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_205_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 230.0 (TID 354) in 73 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 230.0, whose tasks have all completed, from pool 
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 230 (collect at HoodieSparkEngineContext.java:100) finished in 0.083 s
22/11/18 23:35:26 INFO DAGScheduler: Job 175 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage 230: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 175 finished: collect at HoodieSparkEngineContext.java:100, took 0.096327 s
22/11/18 23:35:27 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:27 INFO DAGScheduler: Got job 176 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:27 INFO DAGScheduler: Final stage: ResultStage 231 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:27 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:27 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:27 INFO DAGScheduler: Submitting ResultStage 231 (MapPartitionsRDD[600] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_206 stored as values in memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_206_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_206_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.7 MiB)
22/11/18 23:35:27 INFO SparkContext: Created broadcast 206 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 231 (MapPartitionsRDD[600] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:27 INFO YarnScheduler: Adding task set 231.0 with 1 tasks resource profile 0
22/11/18 23:35:27 INFO TaskSetManager: Starting task 0.0 in stage 231.0 (TID 355) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_206_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:27 INFO TaskSetManager: Finished task 0.0 in stage 231.0 (TID 355) in 76 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:27 INFO YarnScheduler: Removed TaskSet 231.0, whose tasks have all completed, from pool 
22/11/18 23:35:27 INFO DAGScheduler: ResultStage 231 (collect at HoodieSparkEngineContext.java:100) finished in 0.084 s
22/11/18 23:35:27 INFO DAGScheduler: Job 176 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:27 INFO YarnScheduler: Killing all running tasks in stage 231: Stage finished
22/11/18 23:35:27 INFO DAGScheduler: Job 176 finished: collect at HoodieSparkEngineContext.java:100, took 0.085097 s
22/11/18 23:35:27 INFO SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100
22/11/18 23:35:27 INFO DAGScheduler: Got job 177 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:27 INFO DAGScheduler: Final stage: ResultStage 232 (collect at HoodieSparkEngineContext.java:100)
22/11/18 23:35:27 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:27 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:27 INFO DAGScheduler: Submitting ResultStage 232 (MapPartitionsRDD[602] at map at HoodieSparkEngineContext.java:100), which has no missing parents
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_207 stored as values in memory (estimated size 96.2 KiB, free 1048.4 MiB)
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_207_piece0 stored as bytes in memory (estimated size 35.5 KiB, free 1048.4 MiB)
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_207_piece0 in memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB, free: 1048.7 MiB)
22/11/18 23:35:27 INFO SparkContext: Created broadcast 207 from broadcast at DAGScheduler.scala:1467
22/11/18 23:35:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 232 (MapPartitionsRDD[602] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:27 INFO YarnScheduler: Adding task set 232.0 with 1 tasks resource profile 0
22/11/18 23:35:27 INFO TaskSetManager: Starting task 0.0 in stage 232.0 (TID 356) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0, PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_207_piece0 in memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB, free: 10.9 GiB)
22/11/18 23:35:27 INFO TaskSetManager: Finished task 0.0 in stage 232.0 (TID 356) in 75 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:27 INFO YarnScheduler: Removed TaskSet 232.0, whose tasks have all completed, from pool 
22/11/18 23:35:27 INFO DAGScheduler: ResultStage 232 (collect at HoodieSparkEngineContext.java:100) finished in 0.083 s
22/11/18 23:35:27 INFO DAGScheduler: Job 177 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/18 23:35:27 INFO YarnScheduler: Killing all running tasks in stage 232: Stage finished
22/11/18 23:35:27 INFO DAGScheduler: Job 177 finished: collect at HoodieSparkEngineContext.java:100, took 0.084346 s

kazdy commented 1 year ago

Hi @rshanmugam1 how do you read your table? It looks like both the first and second base files for both commits are being read. Hudi should only show you records from the base file for latest commit.

rshanmugam1 commented 1 year ago

@kazdy was using trino + open source hudi to read this.

to simplify this, tried using spark shell inside the emr, which also gives same thing.

spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

spark.read.format("hudi").load("....").show(false)
 or 
spark.sql("select * from db.table").show(false)

also, commit history. in second commit, it shows only 1 upsert, it should had been 3 right?. Screen Shot 2022-11-28 at 2 00 46 PM

kazdy commented 1 year ago

Anothre thing I see is lack of precombine field set in hudi table, spark requires this field to do updates, otherwise, it might fallback to insert and create duplicates. Maybe it's worth to define precombine field and give it a try?

rshanmugam1 commented 1 year ago

tried with precombine key same behavior.

Screen Shot 2022-12-02 at 5 16 48 AM

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'merge',
    file_format = 'hudi',
    options={
      'type': 'cow',
      'primaryKey': 'id',
      'preCombineKey': 'ts',
    },
    unique_key = 'id',
) }}
{% if not is_incremental() %}

select cast(1 as bigint) as id, 'yo' as msg, current_timestamp() as ts
union all
select  cast(2 as bigint) as id, 'anyway' as msg, current_timestamp() as ts
union all
select  cast(3 as bigint) as id, 'bye' as msg, current_timestamp() as ts

{% else %}

select  cast(1 as bigint) as id, 'yo_updated' as msg, current_timestamp() as ts
union all
select cast(2 as bigint) as id, 'anyway_updated' as msg, current_timestamp() as ts
union all
select  cast(3 as bigint) as id, 'bye_updated' as msg, current_timestamp() as ts

{% endif %}

dbt queries

create table analytics.test_merge_3
using hudi
options (type "cow" , primaryKey "id" , preCombineKey "ts")
as
select cast(1 as bigint) as id, 'yo' as msg, current_timestamp() as ts
union all
select  cast(2 as bigint) as id, 'anyway' as msg, current_timestamp() as ts
union all
select  cast(3 as bigint) as id, 'bye' as msg, current_timestamp() as ts

merge into analytics.test_merge_3 as DBT_INTERNAL_DEST
using test_merge_3__dbt_tmp as DBT_INTERNAL_SOURCE
on 
        DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id

when matched then update set
   * 
when not matched then insert *

spark stages Screen Shot 2022-12-02 at 5 18 03 AM

hoodie.properties

#Properties saved on Fri Dec 02 12:15:42 UTC 2022
#Fri Dec 02 12:15:42 UTC 2022
hoodie.table.partition.fields=
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.version=3
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=test_merge_3
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:"long"},{"name"\:"msg","type"\:"string"},{"name"\:"ts","type"\:{"type"\:"long","logicalType"\:"timestamp-micros"}}]}

nsivabalan commented 1 year ago

@jonvex : Can you follow up here.

jonvex commented 1 year ago

You also need to set a partitionpath

JPoirier55 commented 1 year ago

@jonvex Can you give more information about this answer? That doesn't seem to solve the issue

jordanbulka commented 1 year ago

@jonvex @nsivabalan I'm running into the same issue as well. Is anyone currently investigating this?

faizhasan commented 1 year ago

Running into the same issue too. Hope this gets looked at

ad1happy2go commented 1 year ago

@faizhasan @rshanmugam1 Working on reproducing this with precombineField. Will update soon.

Without precombine key, its anyway fallback to insert and you will see duplicates.

ad1happy2go commented 12 months ago

@faizhasan @rshanmugam1 Apologies for the delay here. I tried to reproduce and found out that it is working fine. I tried with 0.12.1 version. Model I used, exactly like we have in ticket

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'merge',
    file_format = 'hudi',
    options={
      'type': 'cow',
      'primaryKey': 'id',
      'preCombineKey': 'ts',
    },
    unique_key = 'id',
    location_root='file:///tmp/dbt/issue_7244_1/'
) }}
{% if not is_incremental() %}

select cast(1 as bigint) as id, 'yo' as msg, current_timestamp() as ts
union all
select  cast(2 as bigint) as id, 'anyway' as msg, current_timestamp() as ts
union all
select  cast(3 as bigint) as id, 'bye' as msg, current_timestamp() as ts

{% else %}

select  cast(1 as bigint) as id, 'yo_updated' as msg, current_timestamp() as ts
union all
select cast(2 as bigint) as id, 'anyway_updated' as msg, current_timestamp() as ts
union all
select  cast(3 as bigint) as id, 'bye_updated' as msg, current_timestamp() as ts

{% endif %}

here are the results after first and second run --

amrishlal commented 12 months ago

Verified using the latest master using the same model as @ad1happy2go above and successfully ran the model

DBT run

18:45:29  Running with dbt=1.5.3
18:45:29  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
18:45:29  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
18:45:29  Registered adapter: spark=1.5.0
18:45:29  Found 1 model, 2 tests, 0 snapshots, 0 analyses, 357 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
18:45:29  
18:45:31  Concurrency: 1 threads (target='dev')
18:45:31  
18:45:31  1 of 1 START sql incremental model default.issue_7244_model .................... [RUN]
18:45:38  1 of 1 OK created sql incremental model default.issue_7244_model ............... [OK in 7.93s]
18:45:39  
18:45:39  Finished running 1 incremental model in 0 hours 0 minutes and 9.22 seconds (9.22s).
18:45:39  
18:45:39  Completed successfully
18:45:39  
18:45:39  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
amrish@Amrishs-MBP github-issue-7244 %

spark-sql verification

spark-sql> show databases;
default
test_database1
Time taken: 2.562 seconds, Fetched 2 row(s)
spark-sql> use default
         > ;
23/07/25 11:47:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Time taken: 0.125 seconds
spark-sql> show tables;
issue_7244_model
my_first_dbt_model
my_first_dbt_model1
my_second_dbt_model
Time taken: 0.263 seconds, Fetched 4 row(s)
spark-sql> select * from issue_7244_model
         > ;
23/07/25 11:47:43 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/07/25 11:47:43 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
20230725114531327   20230725114531327_1_4   2       fbb84dbc-e72f-4ac6-990a-d0205e2aaab3-0_1-33-0_20230725114531327.parquet 2   anyway  2023-07-25 11:45:31.367
20230725114531327   20230725114531327_2_5   3       c1d85730-7a1a-4845-bb4a-1b7128f6de3d-0_2-34-0_20230725114531327.parquet 3   bye 2023-07-25 11:45:31.367
20230725114531327   20230725114531327_0_6   1       1da126fe-eb3a-4982-ab77-f294458eefea-0_0-32-0_20230725114531327.parquet 1   yo  2023-07-25 11:45:31.367
Time taken: 4.461 seconds, Fetched 3 row(s)

amrishlal commented 12 months ago

Also verified against Hudi version 0.12.3 and 0.13.1. 'dbt run' was successful in both case with Hudi tables getting created with data without duplicate rows when dbt is run twice in a row. @faizhasan @rshanmugam1 wondering if you are able to move to a more recent version of Hudi? (0.12.3 or 13.1 for example)?

Note: Hudi 0.10.0 is not supported with Spark 3.2. Please see version support matrix here.

faizhasan commented 8 months ago

Hi @amrishlal apologies for the delay.

I was able to test this and saw the following behavior. with dbt 1.6.2, dbt-spark adapter to execute models on thriftserver:

AWS EMR 6.10.1 (Hudi 0.12.2-amzn-0) : no duplicates created
AWS EMR 6.11.0 (Hudi 0.13.0-amzn-0) and above: duplicates were created

ad1happy2go commented 8 months ago

@faizhasan Can you please try with emr-6.12.0 also , as that include 0.13.1 which includes some fixes. Thanks.

faizhasan commented 8 months ago

I did try emr 6.13.0 and found duplicates. Unfortunately I dont work for that org anymore so dont have the stacks to test with any further.

ad1happy2go commented 8 months ago

@faizhasan You should only find dups if you don't have precombine key configured. As without precombineField it works as operation type insert. I and @amrishlal was never able to reproduce this issue and ran multiple models as discussed above.

apache / hudi

[SUPPORT] DBT Merge creates duplicates #7244