WeBankFinTech / DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
https://github.com/WeBankFinTech/DataSphereStudio-Doc
Apache License 2.0
3.04k stars 999 forks source link

[Feature] Datachecker transformation #1094

Closed wxyn closed 1 year ago

wxyn commented 1 year ago

Search before asking

Problem Description

The current DataChecker uses polling for data verification to connect to the Hive metadata database. However, because data is actually written to Hadoop, this method is inconsistent with data verification.

Description

On the basis of retaining the original datachecker verification partition, users can choose whether to enable Hadoop layer data verification. The new parameters involve the following two aspects: Application layer switch setting parameters:

# JobType common.properties
job.eventchecker.qualitis.switch=true

User layer switch setting parameters:

qualitis.check=true

The prerequisite for conducting Hadoop layer data verification is that both parameters are set to true and the corresponding table partition has been deleted by DOPS.

Logical details: Users set data.object to the table level. After Hive meta or Mask DB check, if the whole table is deleted (non Partition table, query the DOPS DB dops_clean_task_list table according to db_name and tb_name and part_name is null, and the number of results is not 0), further quality verification is required; If the Partition table has partitions that have been deleted (query the DOPS DB dops_clean_task_list table according to db_name and tb_name and part_name is not null, and the number of results is not 0), do not go through the Quality verification, and the data.object verification is deemed to fail.

In the datachecker attribute information, a new parameter 'Enable table or partition data validation' is added to allow users to configure whether to enable Hadoop layer data validation. This parameter can be selected through a dropdown, and the default is true, indicating that Hadoop layer data validation is enabled, while false indicates that it is not enabled; For tasks in stock, this parameter is true when scheduling; If the user views or edits the datachecker node of the inventory and clicks on the node to view/edit node attributes, the parameter value will be displayed as true. image

Use case

No response

solutions

No response

Anything else

No response

Are you willing to submit a PR?