apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.19k stars 2.38k forks source link

[SUPPORT] Write audit publish #6259

Closed melin closed 1 year ago

melin commented 1 year ago

dremio.com/subsurface/write-audit-publish-pattern-via-apache-iceberg/

image

fengjian428 commented 1 year ago

I feel this is more like a snapshot switch and publishing procedure. Do you mean Hudi cannot support this?

melin commented 1 year ago

I feel this is more like a snapshot switch and publishing procedure. Do you mean Hudi cannot support this?

After the data snapshot is generated, it is not visible to users, and the generated data can be queried only after the data quality test is passed.

Not sure if hudi supports this feature?

fengjian428 commented 1 year ago

try set 'as.of.instant' to time travel?

melin commented 1 year ago

try set 'as.of.instant' to time travel?

hudi commited

try set 'as.of.instant' to time travel?

After hudi is committed, it can be directly queried as the latest snapshot. In the data quality inspection scenario, if the data quality inspection SQL is not executed, it is not expected that the latest snapshot will be queried. Published snapshots are visible only after the data quality check is passed.

fengjian428 commented 1 year ago

alter table table SET SERDEPROPERTIES ('as.of.instant'='2022xxxxxxxx'); If you set this in the Hive table, Spark will do time travel automatically, WDYT, would this works for you?

codope commented 1 year ago

Interesting! If I understand correctly, @fengjian428 's suggestion is to go back to a previous snapshot if the latest one is corrupt. Hudi already supports time travel. However, @melin 's suggestion is to not even publish the snapshot if it is found to be corrupt.

Shouldn't savepoint/restore along with a staging area be sufficient to support this feature? It's about how to control the visibility of a snapshot. Hudi metadata can be first written to a staging area before being published. Audit tool runs ETL and validations with the staging metadata. If all is well, then the staging metadata changes are applied to production.

@melin Hudi has all the necessary abstractions to support this feature. Can you please explain your use case and functional requirements in more detail? Are you also proposing abstractions for an audit tool? I think this could be useful to other users as well and worthy of RFC. cc @prasannarajaperumal @vinothchandar @nsivabalan @xushiyan

melin commented 1 year ago

For the data written by spark sql, avoid quality problems (for example, a field has an invalid value, or the data volume fluctuates greatly compared to the previous cycle), and configure some data quality rules. spark sql writes the data, and then executes the quality rules. If the rule detection passes, the 'call procedure command' publishes the written data and can be queried. If there is a problem with the data quality, the 'call procedure command' deletes the generated data.

fengjian428 commented 1 year ago

Interesting! If I understand correctly, @fengjian428 's suggestion is to go back to a previous snapshot if the latest one is corrupt. Hudi already supports time travel. However, @melin 's suggestion is to not even publish the snapshot if it is found to be corrupt.

yeah, it is like implementing the same function in different ways

Shouldn't savepoint/restore along with a staging area be sufficient to support this feature? It's about how to control the visibility of a snapshot. Hudi metadata can be first written to a staging area before being published. Audit tool runs ETL and validations with the staging metadata. If all is well, then the staging metadata changes are applied to production.

I think so, Like I said in the last monthly sync call, I've implemented Snapshot view in our company based on the savepoint feature, although it is for a different scenario @melin FYI https://docs.google.com/presentation/d/1xypcr9onk0ogpj1lrPFQ3ERiXQpTw9NZGW4nok5i80I/edit#slide=id.g13ec3137431_2_195.

@melin Hudi has all the necessary abstractions to support this feature. Can you please explain your use case and functional requirements in more detail? Are you also proposing abstractions for an audit tool? I think this could be useful to other users as well and worthy of RFC. cc @prasannarajaperumal @vinothchandar @nsivabalan @xushiyan

I can be a co-author for this RFC if you want to create one. basically, we need to point the Hudi table to a specific savepoint, but I think we need to do some enhancement on mor's savepoint, since the savepoint seems only tags the base file for now

nsivabalan commented 1 year ago

if I am not wrong, hudi already has data quality validator that you can run before completing a commit. If the validation fails, the commit will abort. Would that work for your case? https://hudi.apache.org/releases/release-0.9.0#writer-side-improvements Check for "pre commit validator framework" in the above link.

vingov commented 1 year ago

@nsivabalan is correct, we needed a write-audit-publish pattern for building our lakehouse for derived datasets at Uber, as part of the effort, the Uber Hudi team developed pre-commit validation support for Hudi.

xushiyan commented 1 year ago

if I am not wrong, hudi already has data quality validator that you can run before completing a commit. If the validation fails, the commit will abort. Would that work for your case? https://hudi.apache.org/releases/release-0.9.0#writer-side-improvements Check for "pre commit validator framework" in the above link.

this is correct. If there is a need for a full-fledged data quality tool, feel free to raise an RFC discussion on the dev email list. Closing this as the write-audit-publish pattern can be fulfilled with the existing precommit validator feature.