Open Armelabdelkbir opened 2 months ago
1.For the first i reorder my datasets columns by moving the __hoodie_is_deleted from the end , and it works well
cdc_hudi=> ALTER TABLE schema_testADD COLUMN department VARCHAR(100);ALTER TABLE
cdc_hudi=> INSERT INTO schema_test (role, salary, department) VALUES ('Developer', 85000.00, 'Engineering');
INSERT 0 1`
in hudi tables we can see the new column
`|CREATE TABLE `cdc_hudi`.`schema_test_ro` (
`_hoodie_commit_time` STRING,
`_hoodie_commit_seqno` STRING,
`_hoodie_record_key` STRING,
`_hoodie_partition_path` STRING,
`_hoodie_file_name` STRING,
`_hoodie_is_deleted` BOOLEAN,
`ts_ms` BIGINT,
`op` STRING,
`id` INT,
`role` STRING,
`salary` DOUBLE,
`department` STRING)
USING hudi
OPTIONS (
`hoodie.query.as.ro.table` 'true')
i tried also to rename column but not working as expected: Rename column in PG:
cdc_hudi=> ALTER TABLE schema_test
RENAME COLUMN department TO team;
ALTER TABLE
cdc_hudi=> INSERT INTO schema_test (role, salary, team)
VALUES ('DataOps', 85000.00, 'Engineering');
in hudi tables i see both of columns old one and new one :
|CREATE TABLE `cdc_hudi`.`schema_test_ro` (
`_hoodie_commit_time` STRING,
`_hoodie_commit_seqno` STRING,
`_hoodie_record_key` STRING,
`_hoodie_partition_path` STRING,
`_hoodie_file_name` STRING,
`_hoodie_is_deleted` BOOLEAN,
`ts_ms` BIGINT,
`op` STRING,
`id` INT,
`role` STRING,
`salary` DOUBLE,
`department` STRING,
`team` STRING)
USING hudi
OPTIONS (
`hoodie.query.as.ro.table` 'true')
spark.sql(s"select op,id,role,salary,department,team from cdc_hudi.schema_test_ro").show(100)
+---+---+---------+--------+-----------+-----------+
| op| id| role| salary| department| team|
+---+---+---------+--------+-----------+-----------+
| c| 2| Manager| 90000.0| null| null|
| c| 3| Analyst|65000.25| null| null|
| c| 4| Engineer| 75000.5| null| null|
| c| 5|Developer| 85000.0|Engineering| null|
| c| 6| DataOps| 85000.0| null|Engineering|
+---+---+---------+--------+-----------+-----------+
@Armelabdelkbir This is expected, as it doesn't supports that. it will take new batch is like schema evolved and adding the column at last.
Just realising just with incremental, how hudi will identify if it was a rename with the debezium payload. It could be column drop and addition of new column also.
@ad1happy2go Thanks i understand for RENAME, what about DROP table, i tried to drop column in table, but meta-sync doesn't sync the new schema, when i insert some values i see the old column with NULL values in PG:
cdc_hudi=> alter table schema_test DROP team ;
ALTER TABLE
cdc_hudi=> select * from schema_test ;
id | role | salary
----+-----------+----------
2 | Manager | 90000
3 | Analyst | 65000.25
4 | Engineer | 75000.5
5 | Developer | 85000
6 | DataOps | 85000
cdc_hudi=> INSERT INTO schema_test (role, salary) VALUES
('Tea Engineer','200M')
|CREATE TABLE `cdc_hudi`.`schema_test_ro` (
`_hoodie_commit_time` STRING,
`_hoodie_commit_seqno` STRING,
`_hoodie_record_key` STRING,
`_hoodie_partition_path` STRING,
`_hoodie_file_name` STRING,
`_hoodie_is_deleted` BOOLEAN,
`ts_ms` BIGINT,
`op` STRING,
`id` INT,
`role` STRING,
`salary` STRING,
`team` STRING)
USING hudi
OPTIONS (
`hoodie.query.as.ro.table` 'true')
spark.sql(s"select id, role, salary, team from cdc_hudi.schema_test_ro").show(20)
+---+-----------------+----------+-----------+ | id| role| salary| team| +---+-----------------+----------+-----------+ | 2| Manager| 90000.0| Engineering| | 3| Analyst| 65000.25| Engineering| | 4| Engineer| 75000.5| Engineering| | 5| Developer| 85000.0| Engineering| | 6| DataOps| 85000.0|Engineering| | 7| Tea Engineer| 200M| null| +---+-----------------+----------+-----------+
is this behavior expected ? or i miss something
@Armelabdelkbir You may need to use https://hudi.apache.org/docs/configurations/#hoodiedatasourcewriteschemaallowautoevolutioncolumndrop for that.
@ad1happy2go i add this property and i test, always remain the same: i have null values and schema contains, the old values: i drop salary column in postgresql
+---+-----------------+------+
| id| role|salary|
+---+-----------------+------+
| 2| Manager| null|
| 3| Analyst| null|
| 4| Engineer| null|
| 5| Developer| null|
| 6| DataOps| null|
| 7| DataOps| null|
| 8|Rockstar engineer| null|
| 9| Tea Engineer| null|
| 11| Beer Engineer| null|
| 10| Tea Engineer| null|
| 12| Coffee engineer| null|
| 13| Coffee engineer| null|
| 14| Coffee engineer| null|
+---+-----------------+------+
@Armelabdelkbir Do you have the debezium payload which you are getting after column drop. I can test out.
Hi @Armelabdelkbir
I have tested your scenario with following code. drop column also worked with out any issues.
Step1: Create a database and table in postgres
CREATE DATABASE cdc_hudi;
\connect cdc_hudi;
CREATE TABLE employees (
id INT PRIMARY KEY NOT NULL,
name VARCHAR(100) NOT NULL,
age INT,
salary DECIMAL(10, 2),
department VARCHAR(50)
);
ALTER TABLE employees REPLICA IDENTITY FULL;
Step2: Configure the debezium source connector for postgres db.
vi debezium-source-postgres.json
{
"name": "postgres-debezium-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"plugin.name": "pgoutput",
"database.hostname": "localhost",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname": "cdc_hudi",
"topic.prefix": "cdc_topic",
"database.server.name": "postgres",
"schema.include.list": "public",
"table.include.list": "public.employees",
"publication.autocreate.mode": "filtered",
"tombstones.on.delete": "false",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081/",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081/",
"slot.name": "pgslot1"
}
}
Step3: Create the postgres-debezium-connector
connector using above configuration and verify the status
curl -s -X POST -H 'Accept: application/json' -H "Content-Type:application/json" \
-d @debezium-source-postgres.json http://localhost:8083/connectors/ | jq
curl -s -X GET -H "Content-Type:application/json" http://localhost:8083/connectors/postgres-debezium-connector/status | jq
Step4: Insert one record in postgres db and verify kafka topic is created or not
INSERT INTO employees (id, name, age, salary, department) VALUES (1, 'Ranga', 30, 50000.00, 'Sales');
SELECT * FROM employees;
$CONFLUENT_HOME/bin/kafka-topics --list --bootstrap-server localhost:9092 | grep employees
cdc_topic.public.employees
Step5: Run the Kafka Consumer to verify the data is consumed or not.
$CONFLUENT_HOME/bin/kafka-avro-console-consumer \
--bootstrap-server localhost:9092 \
--from-beginning \
--topic cdc_topic.public.employees
Step6: Run the Spark application
HUDI_HOME=/Install/apache/hudi
HUDI_UTILITIES_JAR=$(ls $HUDI_HOME//packaging/hudi-utilities-bundle/target/hudi-utilities-bundle*.jar | grep -v sources | grep -v tests)
$SPARK_HOME/bin/spark-submit \
--master "local[2]" \
--deploy-mode client \
--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false' \
--conf 'spark.hadoop.spark.sql.parquet.binaryAsString=false' \
--conf 'spark.hadoop.spark.sql.parquet.int96AsTimestamp=true' \
--conf 'spark.hadoop.spark.sql.caseSensitive=false' \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_JAR \
--hoodie-conf hoodie.datasource.write.recordkey.field=id \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator \
--hoodie-conf bootstrap.servers=localhost:9092 \
--hoodie-conf schema.registry.url=http://localhost:8081 \
--hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/cdc_topic.public.employees-value/versions/latest \
--hoodie-conf hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer \
--hoodie-conf hoodie.deltastreamer.source.kafka.topic=cdc_topic.public.employees \
--hoodie-conf auto.offset.reset=earliest \
--hoodie-conf hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true \
--table-type MERGE_ON_READ \
--op UPSERT \
--target-base-path file:///tmp/debezium/postgres/employees \
--target-table employees_cdc \
--source-class org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource \
--source-ordering-field _event_lsn \
--payload-class org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload \
--continuous \
--min-sync-interval-seconds 60
Step7: Perform the schema evaluation operations and verify the data.
-- Adding Columns
ALTER TABLE employees ADD COLUMN address VARCHAR;
INSERT INTO employees (id, name, age, salary, department, address) VALUES (2, 'Nishanth', 30, 50000.00, 'Sales', 'Bangalore');
SELECT * FROM employees;
Step8: Launch another spark job and verify the data
export SPARK_VERSION=3.5
$SPARK_HOME/bin/spark-shell \
--master "local[2]" \
--deploy-mode client \
--packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false' \
--conf 'spark.hadoop.spark.sql.parquet.binaryAsString=false' \
--conf 'spark.hadoop.spark.sql.parquet.int96AsTimestamp=true' \
--conf 'spark.hadoop.spark.sql.caseSensitive=false' \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
val basePath="file:///tmp/debezium/postgres/employees"
spark.read.format("hudi").load(basePath).show(false)
Step8: You can repeat further steps and verify the data.
-- Renaming a column
ALTER TABLE employees RENAME COLUMN department TO dept;
INSERT INTO employees (id, name, age, salary, dept, address) VALUES (3, 'Vinod', 35, 40000.00, 'HR', 'Andhrapradesh');
SELECT * FROM employees;
ALTER TABLE employees DROP column dept;
INSERT INTO employees (id, name, age, salary, address) VALUES (4, 'Manjo', 35, 40000.00, 'Chennai');
SELECT * FROM employees;
-- Type change column
ALTER TABLE employees ALTER COLUMN age TYPE bigint;
INSERT INTO employees (id, name, age, salary, address) VALUES (5, 'Raja', 59, 70000.00, 'Bangalore');
SELECT * FROM employees;
Spark-Submit Output:
Postgres Output:
Kafka topic output:
@Armelabdelkbir As demonstrated by @rangareddy drop column is working as expected with that config. Let us know in case you have any more issues on this. Thanks
Describe the problem you faced hello i try to test several schema evolution usecases using hudi 0.15 and spark3.5 using hms 4 first test: Adding column in PG --> debezium / schema registry ok --> hudi (MOR) hivesyncTool KO
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert field Type from DOUBLE to string for field salary
when i drop tables (ro / rt ) and i restart my job is creating new tables correctly, but is not production way to handle schemas evolutions
To Reproduce
Steps to reproduce the behavior:
cdc_hudi=> ALTER TABLE employees ADD COLUMN test_str VARCHAR ; ALTER TABLE cdc_hudi=> INSERT INTO employees (name, department, username, test_str) VALUES ('armel011', 'Engineering', 'arm23220', 'teststr'); INSERT 0 1
2.debezium / schema registry ok ( latest version contains added columns)3.restart spark job
Expected behavior
Case 1. add column Case 2: Rename column Case 3. change datatype double to string Case 4. drop column
Environment Description
Hudi version : 0.15.0
Spark version : 3.5.1
Hive version : 4.0.0
Hadoop version : 3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Stacktrace