dlpzx commented 6 months ago

Is your feature request related to a problem? Please describe.

Data.all currently offers some profiling capabilities for tables based on the open source AWS project deequ. Users can trigger a predefined deequ profiling Glue job that will generate several profiling metrics on the table. But there are some features that are recurrently asked byusers and are not included into data.all:

The profiling job cannot be edited by users.
Users cannot define data quality rules for datasets
Data quality metrics are not stored or used to classify and rank “the best data”

Describe the solution you'd like I want to easily define data quality rules in a nice UI even if I am not familiar with the Glue Data Quality Definition Language.

I would like data.all to include an end-to-end workflow that allows users to define and supervise data quality rules using AWS native services such as Glue Data Quality.

Describe alternatives you've considered We considered other DQ tools, but AWS Glue Data Quality integrates nicely with our current Glue Databases

Additional context We will use an internal AWS artefact that translates human-readable data into Data Quality Definition Lamguage

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

noah-paige commented 6 months ago

Adding to backlog and evaluating high level design considerations for the proposed new feature

dlpzx commented 5 months ago

DESIGN

High level design

We will create the user experince and in the backend adapt the internal AWS artefact to data.all. At the end customers can use a nice UI to create DQ rules that are translated into DQDL using data.all backend.

User experience

Dataset owners create Datasets and tables in data.all UI
To create a new DQ rule, dataset owners go to the Table/Quality tab:
1. click “Add new rule” (as in the screenshot below)
2. A modal view opens with list of available Glue DQ rules, select a rule and
3. different parameters appear, user fills the input parameters = e.g. columns, dq_expression and press “create” pop-up closes.
4. This allows user to create a rule for a table and select multiple columns for that rule.
Dataset owners can see a list of all the DQ Rules applied on the Table in the Table/Quality tab. (as in the screenshot below)
Dataset owners click “Evaluate Rules” to evaluate rules. In a second phase they will be able to deactivate rules. It triggers a compute resource that evaluates the DQ rules (this boto3 call).
Dataset owners can see the status, last evaluation time and Quality score in the UI (in the list of rules). Dataset owners can use a data.all provided redirect link to open the DQ Glue console in AWS for the specific table and see all the results.
Dataset owners can edit a DQ Rule from the list, clicking on a “pencil icon” next to the rule item in the list table A modal view opens with the fields that can be edited.
Dataset owner can delete a DQ Rule from the list, clicking on a “trash icon” next to the rule item in the list table.

UI views

Data Quality tab in Table view→ list (SEE MOCK-UP BELOW) + Results in the top (Status, score, last execution time)+ evaluateRules button
Modal window to create a new DQ Rule
Modal window to edit DQ Rule (re-use the same one as create)

API calls

createDataQualityRule - create metadata for rule
updateDataQualityRule - update some metadata (edit)
deleteDataQualityRule- delete metadata for rule
listTableDataQualityRules - lists all rules metadata for a single table
evaluateDataQualityRules - triggers asynchronous evaluation of all rules
getTableDataQualityResuls- triggers asynchronous polling of results (Status, score, last execution time)

Asynchronous handlers

The API evaluateDataQualityRules triggers the data.all Worker Lambda. We will create a worker method (handler) that uses the APG framework and creates the data quality ruleset in the Environment account and triggers it using the datasetIAMRole and sets as result location: environment-bucket/dataquality/results/team/ruleUri

The API getTableDataQualityResuls riggers the data.all Worker Lambda. It triggers a worker method (handler) that calls several Glue APIs to return the DataQuality score, last evaluation and status of the ruleset created by data.all for that table.

RDS tables

glue_data_quality_rule table - contains the list of rules available in AWS Glue. At the moment there is not an API to get the list of Glue rules.
table_data_quality_rule table - contains the data.all defined list of rules. They reference a single glue-rule, and apply to a single table. Each table can have multiple rules.
table_data_quality_rule_column table - contains the columns that are affected by a rule. For a single rule there might be multiple columns where the rule is applied. See the examples, it will be clearer.

class GlueDataQualityRule(Base): # Rules from Glue
    __tablename__ = 'glue_data_quality_rule'
    glueRuleUri = Column(String, primary_key=True, default=utils.uuid('dqgluerule')) #Instead of rule_id
    name = Column(String, nullable=False) # rule_name
    description = Column(String, nullable=False) # rule_desc
    ruleSyntax = Column(String, nullable=False) # rule_syntax

class TableDataQualityRule(Base): # Rules in data.all
    __tablename__ = 'table_data_quality_rule'
    ruleUri = Column(String, primary_key=True, default=utils.uuid('dqrule')) #Instead of ruleset_id
    glueRuleUri = Column(String, ForeignKey('glue_data_quality_rule.glueRuleUri', ondelete='CASCADE'), nullable=False)
    datasetUri = Column(String, nullable=False)  # data.all DATASET identifier - might be needed for application permissions
    tableUri = Column(String, nullable=False) # data.all table identifier
    GlueDatabaseName = Column(String, nullable=False)
    GlueTableName = Column(String, nullable=False)

class TableDataQualityRuleColumn(Base): # Columns where a rule is applied
    __tablename__ = 'table_data_quality_rule_column'
    uri = Column(String, primary_key=True, default=utils.uuid('dqrulecol'))
    ruleUri = Column(String, ForeignKey('table_data_quality_rule.ruleUri', ondelete='CASCADE'), nullable=False)
    targetColumnNameOne = Column(String, nullable=True) # data.all column identifier (from DatasetTableColumn)
    targetColumnNameTwo = Column(String, nullable=True) # data.all column identifier (from DatasetTableColumn)
    dqExpression = Column(String, nullable=True) # dq_expression

Effort estimation

Estimations for a new developer in data.all. Ramp-up is not considered in the estimations.

RDS tables, Graphql models and basic CRUD operations → 5 days
Frontend views → 3 days
Data quality creator handler → 1 day
Resuls handler → 1day
Tests → 5 days
Buffer for validation, improvements → 5 days

Total: 20days

Key Design Decisions

Should we create DQ rulesets using API calls (AWS SDK for Python) or using CDK?

The only resource created here are data quality rulesets. The decision of whether to use API or IaC depends on the lifecycle management of the resource. In other words, how do we update and clean-up rulesets? In this particular case where we are dynamically modifying the rules of a ruleset it is easier to do it directly with an API (in the async Worker Lambda). When a database or a table is deleted in AWS the DQ rulesets will be deleted, which means that the clean-up is handled in the clean-up of those resources. In short, there is no downside of using SDK and it makes it easier to handle.

Where should we store the glue_data_quality_rules table?

We should make the DataQualityHandler already built in the internal artefact (currently called DataQuality class) code database agnostic. We could achieve that by passing the necessary glue_data_quality_rules data as dicts instead of fetching the data inside the handler functions. That would allow other projects to easily adopt the “DQDL translator” class which is the main value added of the internal artefact.

Having said that, the data in glue_data_quality_rules is READ-ONLY static information common to all users. It makes sense to have this table stored in a single table in the data.all central account. Because data.all already uses RDS in the central account, the simplest approach would be to create a new table in RDS.

Where should we store the data_quality_rules table?

It would be great if, as explained above, DataQualityHandler (current called DataQuality class) is database agnostic.

In contrast todata_quality_rules , data_quality_ruleset table is modified with every new or modified data quality job. Each team will create multiple rows on this table.

With those points in mind, we can consider 2 alternatives: a) RDS table in central data.all account

(+) simplicity, already deployed
(+) EASY TO READ INFO in API calls
(-) end-user cannot access the ruleset info table
(-) Dynamic number of columns in a ruleset --->We could have a STAR tables with ruleset_columns matching the rulesetUr

b) DynamoDB table in each data.all environment account

(+) more suitable for unstructured data
(-) NEW infra to operate :(

Because the information can be stored in RDS, we decided to go for the simplest solution.

Where should we store the results data?

We will store the results in the Environment S3 bucket deployed in each of the accounts. We will segregate teams by prefix and restrict permissions. The S3 results location is specified at the moment a ruleset is evaluated.

Next Steps

Notifications on success/failure of rules
schedule evaluation of rules
sort datasets based on data quality

anmolsgandhi commented 5 months ago

190 We should consider implementing this functionality as we think through DQ - this will be very helpful sort datasets with good data quality which would be helpful for end users

dlpzx commented 5 months ago

190 We should consider implementing this functionality as we think through DQ - this will be very helpful sort datasets with good data quality which would be helpful for end users

Absolutely! We want to store Status, score, last execution time. Score is what we would use for sorting. I will add the sorting capabilities as next step in the design above

zsaltys commented 5 months ago

@dlpzx I think we could expand on:

Dataset owners can see the status, last evaluation time and Quality score in the UI (in the list of rules). Dataset owners can use a data.all provided redirect link to open the DQ Glue console in AWS for the specific table and see all the results.

Could we see mockups for this? I think this is by far the most valuable part of all of this is to see in data.all what is the actual quality of the dataset.

@anmolsgandhi @dlpzx

I would personally even go further and say that I'm not sure we should be managing the rules of glue data quality in data.all? What value are we providing over just logging into AWS console? Are we not just making a poorer UI version of the AWS console? I think data.all should stay away from replicating UIs that exist directly in AWS console. I think we may be rushing a bit too much here and need to think more what valuable and meaningful change we can make in data.all to help users understand the quality of published datasets. I can see value in data.all setting up everything needed so that users only have to log in to AWS console and create rules there ... And then data.all can show a combined score or a report of failing rules etc as that is useful to anyone who is interested in the data product and they wouldn't have access to the owners AWS account.

data-dot-all / dataall

Data Quality rules using AWS Glue Data Quality #1232

DESIGN

High level design

User experience

UI views

API calls

Asynchronous handlers

RDS tables

Effort estimation

Key Design Decisions

Should we create DQ rulesets using API calls (AWS SDK for Python) or using CDK?

Where should we store the glue_data_quality_rules table?

Where should we store the data_quality_rules table?

Where should we store the results data?

Next Steps

190 We should consider implementing this functionality as we think through DQ - this will be very helpful sort datasets with good data quality which would be helpful for end users

190 We should consider implementing this functionality as we think through DQ - this will be very helpful sort datasets with good data quality which would be helpful for end users