apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.39k stars 2.42k forks source link

[SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 #10884

Closed ROOBALJINDAL closed 6 months ago

ROOBALJINDAL commented 7 months ago

Issue:

We have migrated from Hudi 0.13.0 to Hudi 0.14.0 and in this version, CDC events from Kafka upserts are not working. Table is created first time but afterwards, any new record added/updated into the sql table which pushes cdc event to kafka is not get updated in the hudi table. Is there any new configuration for hudi 0.14.0?

We are running Aws EMR serverless: 6.15. We tried to enable debug level logs by providing following classification to serverless app which modified log4j properties to print hudi package logs but this also doesnt print.

{
      "classification": "spark-driver-log4j2",
      "properties": {
        "rootLogger.level": "debug",
        "logger.hudi.level": "debug",
        "logger.hudi.name": "org.apache.hudi"
      }
    },
    {
      "classification": "spark-executor-log4j2",
      "properties": {
        "rootLogger.level": "debug",
        "logger.hudi.level": "debug",
        "logger.hudi.name": "org.apache.hudi"
      }
    }

Since it is serverless we can't ssh tunnel into node and see log4j property file and couldn't get hudi logs.

Configurations:

### Spark job parameters:

--class org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer
--conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED
--conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED
--conf spark.executor.instances=1
--conf spark.executor.memory=4g
--conf spark.driver.memory=4g
--conf spark.driver.cores=4
--conf spark.dynamicAllocation.initialExecutors=1
--props kafka-source.properties
--config-folder table-config
--payload-class com.myorg.MssqlDebeziumAvroPayload
--source-class com.myorg.MssqlDebeziumSource
--source-ordering-field _event_lsn
--enable-sync
--table-type COPY_ON_WRITE
--source-limit 1000000000
--op UPSERT

### kafka-source.properties:

hoodie.streamer.ingestion.tablesToBeIngested=database1.student
 auto.offset.reset=earliest
 hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
 hoodie.streamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer
 hoodie.streamer.schemaprovider.registry.url=
 schema.registry.url=http://schema-registry-xxxxx:8080/apis/ccompat/v6
 bootstrap.servers=b-1.xxxx.ikwdtc.c13.us-west-2.amazonaws.com:9096
 hoodie.streamer.schemaprovider.registry.baseUrl=http://schema-registry-xxxxx:8080/apis/ccompat/v6/subjects/
 hoodie.parquet.max.file.size=2147483648
 hoodie.parquet.small.file.limit=1073741824
 security.protocol=SASL_SSL
 sasl.mechanism=SCRAM-SHA-512
 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="XXXX" password="xxxxx";
 ssl.truststore.location=/usr/lib/jvm/java/jre/lib/security/cacerts
 ssl.truststore.password=changeit

### Table config properties:

hoodie.datasource.hive_sync.database=database1
 hoodie.datasource.hive_sync.support_timestamp=true
 hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
hoodie.datasource.write.recordkey.field=studentsid
 hoodie.datasource.write.partitionpath.field=studentcreationdate
 hoodie.datasource.hive_sync.table=student
 hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true
 hoodie.datasource.hive_sync.partition_fields=studentcreationdate
 hoodie.keygen.timebased.timestamp.type=SCALAR
 hoodie.keygen.timebased.timestamp.scalar.time.unit=DAYS
 hoodie.keygen.timebased.input.dateformat=yyyy-MM-dd
 hoodie.keygen.timebased.output.dateformat=yyyy-MM-01
 hoodie.keygen.timebased.timezone=GMT+8:00
 hoodie.datasource.write.hive_style_partitioning=true
 hoodie.datasource.hive_sync.mode=hms
 hoodie.streamer.source.kafka.topic=dev.student
 hoodie.streamer.schemaprovider.registry.urlSuffix=-value/versions/latest

Environment Description

ROOBALJINDAL commented 7 months ago

@nsivabalan can you please check

ad1happy2go commented 7 months ago

@ROOBALJINDAL Is it possible to try the same on EMR so that you will get all the logs to look into this more. There is no known updates which can cause this for 0.14.0 upgrade.

ROOBALJINDAL commented 7 months ago

@ad1happy2go need time to setup new cluster. Our aws msk kafka cluster uses kafka version=2.6.2, can you confirm is this fine or this can be an issue? Any specific supported version of kafka?

ad1happy2go commented 7 months ago

Dont think it can be kafka version related issue as job is not failing. we need to know more logs to debug this.

ROOBALJINDAL commented 6 months ago

I have found the issue. We were using custom MssqlDebeziumSource class as debezium source and in constructor we were using HoodieStreamerMetrics instead of HoodieIngestionMetrics (which is introduced in hudi 14.0)

Once corrected the class, it started working. We can close this issue