Closed brandon-stanley closed 3 years ago
@prashantwason @nbalajee @satishkotha : Can you please look into this ?
I think the distinction is in UPDATE use-cases. Consider this scenario:
t1: Insert with Schema 1: file1.parquet is created and records have schema1
t2: Update with Schema 2: Suppose schema 2 has 1 field deleted. A single record is being updated. This will lead to file1.parquet being read and re-written (after update of the single record) into file2.parquet. But all records in file2.parquet would no have the deleted field.
Another scenario is possible where the deleted field is later added back with a different incompatible "type" (e.g. an int field was deleted and another field with same name but "string" type was added). This schema will have issues reading historical data within the dataset which was written with older schema.
If you want to delete field within a HUDI dataset, it may be simpler to copy the dataset using a new schema.
Thanks for your response @prashantwason.
Does this mean that the implementation of maintaining schemas within Hudi is more of a wrapper around Avro which has an additional check to ensure that there are no deletes because of the scenarios you listed above? I just want to confirm because the documentation that I've previously linked states that deletes are supported for BACKWARDS
compatible schemas within Avro:
Cheers,
Brandon
Thats correct.
HUDI does not have a full schema management system. The schema to be used is provided at the time of the write where we validate that the schema being used for current write is compatible with the existing schema (from the previous writes). Hence, HUDI schema management is very simplistic compared to the documentation you have referred.
In producer-consumer systems, schema compatibility is a simpler job - by upgrading the producer and consumer code with newer schemas the schema can be changed - as all new data will be generated using a schema which both understand and there is no historical data with older schema version to be processed any longer. But within HUDI there are always versions of data saved with older schema and to continue to provide features like incremental read (which reads data over a time-range) and updates (old data can be changed), we have to restrict the schema modification.
@prashantwason does hudi support adding new columns or changing existing columns types (ie long to string) ?
Yes, adding new columns (fields in schema) is supported as long as they have default values specified. This is because the new fields will not be present in older records and hence cannot be populated directly on reading records from existing data.
The following field type changes are allowed: old_type ->. new_type int long int float int double long double float double bytes string string bytes
Code references: https://github.com/apache/hudi/pull/2350/files https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/SchemaCompatibility.java#L359
@prashantwason : In lieu of this ticket, do you think we can update our documentation wrt schema evolution. If you don't mind can you take it up and fix our documentation. https://issues.apache.org/jira/browse/HUDI-1548
CC @n3nash. schema evolution related ask.
Closing this ticket, docs will be added as part of the JIRA. @brandon-stanley Feel free to re-open if needed
Reason why Hudi does not support field deletions - https://hudi.apache.org/docs/troubleshooting#caused-by-orgapacheparquetioinvalidrecordexception-parquetavro-schema-mismatch-avro-field-col1-not-found
Hi Hudi Team! I have a question about field deletions/schema evolution. The FAQ Documentation states the following:
While reading the Confluent Documentation that is linked above, I noticed that "Delete fields" is an "allowed change" for
BACKWARDS
compatible schemas. I assume that the Avro schema that is tracked within Hudi isBACKWARDS
compatible and therefore should allow field deletions, but the FAQ Documentation states otherwise. Can you please clarify the following: