apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.44k stars 2.42k forks source link

Why does Hudi not support field deletions? #2331

Closed brandon-stanley closed 3 years ago

brandon-stanley commented 3 years ago

Hi Hudi Team! I have a question about field deletions/schema evolution. The FAQ Documentation states the following:

Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.

While reading the Confluent Documentation that is linked above, I noticed that "Delete fields" is an "allowed change" for BACKWARDS compatible schemas. I assume that the Avro schema that is tracked within Hudi is BACKWARDS compatible and therefore should allow field deletions, but the FAQ Documentation states otherwise. Can you please clarify the following:

  1. Why field deletions are not supported within Hudi?
  2. Is there is a way to determine (and possibly update) the Avro schema compatibility type for a Hudi table?
bvaradar commented 3 years ago

@prashantwason @nbalajee @satishkotha : Can you please look into this ?

prashantwason commented 3 years ago

I think the distinction is in UPDATE use-cases. Consider this scenario:

t1: Insert with Schema 1: file1.parquet is created and records have schema1

t2: Update with Schema 2: Suppose schema 2 has 1 field deleted. A single record is being updated. This will lead to file1.parquet being read and re-written (after update of the single record) into file2.parquet. But all records in file2.parquet would no have the deleted field.

Another scenario is possible where the deleted field is later added back with a different incompatible "type" (e.g. an int field was deleted and another field with same name but "string" type was added). This schema will have issues reading historical data within the dataset which was written with older schema.

If you want to delete field within a HUDI dataset, it may be simpler to copy the dataset using a new schema.

brandon-stanley commented 3 years ago

Thanks for your response @prashantwason.

Does this mean that the implementation of maintaining schemas within Hudi is more of a wrapper around Avro which has an additional check to ensure that there are no deletes because of the scenarios you listed above? I just want to confirm because the documentation that I've previously linked states that deletes are supported for BACKWARDS compatible schemas within Avro:

image

Cheers,

Brandon

prashantwason commented 3 years ago

Thats correct.

HUDI does not have a full schema management system. The schema to be used is provided at the time of the write where we validate that the schema being used for current write is compatible with the existing schema (from the previous writes). Hence, HUDI schema management is very simplistic compared to the documentation you have referred.

In producer-consumer systems, schema compatibility is a simpler job - by upgrading the producer and consumer code with newer schemas the schema can be changed - as all new data will be generated using a schema which both understand and there is no historical data with older schema version to be processed any longer. But within HUDI there are always versions of data saved with older schema and to continue to provide features like incremental read (which reads data over a time-range) and updates (old data can be changed), we have to restrict the schema modification.

tooptoop4 commented 3 years ago

@prashantwason does hudi support adding new columns or changing existing columns types (ie long to string) ?

prashantwason commented 3 years ago

Yes, adding new columns (fields in schema) is supported as long as they have default values specified. This is because the new fields will not be present in older records and hence cannot be populated directly on reading records from existing data.

The following field type changes are allowed: old_type ->. new_type int long int float int double long double float double bytes string string bytes

Code references: https://github.com/apache/hudi/pull/2350/files https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/SchemaCompatibility.java#L359

nsivabalan commented 3 years ago

@prashantwason : In lieu of this ticket, do you think we can update our documentation wrt schema evolution. If you don't mind can you take it up and fix our documentation. https://issues.apache.org/jira/browse/HUDI-1548

nsivabalan commented 3 years ago

CC @n3nash. schema evolution related ask.

n3nash commented 3 years ago

Closing this ticket, docs will be added as part of the JIRA. @brandon-stanley Feel free to re-open if needed

pratyakshsharma commented 2 years ago

Reason why Hudi does not support field deletions - https://hudi.apache.org/docs/troubleshooting#caused-by-orgapacheparquetioinvalidrecordexception-parquetavro-schema-mismatch-avro-field-col1-not-found