Closed christianfelicite closed 3 years ago
https://towardsdatascience.com/7-steps-to-ensure-and-sustain-data-quality-3c0040591366
Below lists 5 main criteria used to measure data quality:
Accuracy: for whatever data described, it needs to be accurate.
Relevancy: the data should meet the requirements for the intended use.
Completeness: the data should not have missing values or miss data records.
Timeliness: the data should be up to date.
Consistency:the data should have the data format as expected and can be cross reference-able with the same results.
https://towardsdatascience.com/7-steps-to-ensure-and-sustain-data-quality-3c0040591366
There are 7 essential steps to making that happen:
In most cases, bad data comes from data receiving. In an organization, the data usually comes from other sources outside the control of the company or department. It could be the data sent from another organization, or, in many cases, collected by third-party software. Therefore, its data quality cannot be guaranteed, and a rigorous data quality control of incoming data is perhaps the most important aspect among all data quality control tasks. A good data profiling tool then comes in handy; such a tool should be capable of examining the following aspects of the data:
Data format and data patterns
Data consistency on each record
Data value distributions and abnormalies
Completeness of the data
It is also essential to automate the data profiling and data quality alerts so that the quality of incoming data is consistently controlled and managed whenever it is received — never assume an incoming data is as good as expected without profiling and checks. Lastly, each piece of incoming data should be managed using the same standards and best practices, and a centralized catalog and KPI dashboard should be established to accurately record and monitor the quality of the data.
Duplicate data refers to when the whole or part of data is created from the same data source, using the same logic, but by different people or teams likely for different downstream purposes. When a duplicate data is created, it is very likely out of sync and leads to different results, with cascading effects throughout multiple systems or databases. At the end, when a data issue arises, it becomes difficult or time-consuming to trace the root cause, not to mention fixing it.
In order for an organization to prevent this from happening, a data pipeline needs to be clearly defined and carefully designed in areas including data assets, data modeling, business rules, and architecture. Effective communication is also needed to promote and enforce data sharing within the organization, which will improve overall efficiency and reduce any potential data quality issues caused by data duplications. This gets into the core of data management, the details of which are beyond the scope of this article. On a high level, there are 3 areas that need to be established to prevent duplicate data from being created:
A data governance program, which clearly defines the ownership of a dataset and effectively communicates and promotes dataset sharing to avoid any department silos.
Centralized data assets management and data modeling, which are reviewed and audited regularly.
Clear logical design of data pipelines at the enterprise level, which is shared across the organization.
With today’s rapid changes in technology platforms, solid data management and enterprise-level data governance are essential for future successful platform migrations.
An important aspect of having good data quality is to satisfy the requirements and deliver the data to clients and users for what the data is intended for. It is not as simple as it first sounds, because:
It is not easy to properly present the data. Truly understanding what a client is looking for requires thorough data discoveries, data analysis, and clear communications, often via data examples and visualizations.
The requirement should capture all data conditions and scenarios — it is considered incomplete if all the dependencies or conditions are not reviewed and documented.
Clear documentation of the requirements, with easy access and sharing, is another important aspect, which should be enforced by the Data Governance Committee.
The role of Business Analyst is essential in requirement gathering. Their understanding of the clients, as well as current systems, allows them to speak both sides’ languages. After gathering the requirements, business analysts also perform impact analysis and help to come up with test plans to make sure the data produced meets the requirements.
An important feature of the relational database is the ability to enforce data Integrity using techniques such as foreign keys, check constraints, and triggers. When the data volume grows, along with more and more data sources and deliverables, not all datasets can live in a single database system. The referential integrity of the data, therefore, needs to be enforced by applications and processes, which need to be defined by best practices of data governance and included in the design for implementation. In today’s big data world, referential enforcement has become more and more difficult. Without the mindset of enforcing integrity in the first place, the referenced data could become out of date, incomplete or delayed, which then leads to serious data quality issues.
For a well-designed data pipeline, the time to troubleshoot a data issue should not increase with the complexity of the system or the volume of the data. Without the data lineage traceability built into the pipeline, when a data issue happens, it could take hours or days to track down the cause. Sometimes it could go through multiple teams and require data engineers to look into the code to investigate.
Data Lineage traceability has 2 aspects:
Meta-data: the ability to trace through the relationships between datasets, data fields and the transformation logic in between.
Data itself: the ability to trace a data issue quickly to the individual record(s) in an upstream data source.
Meta-data traceability is an essential part of effective data governance. This is enabled by clear documentation and modeling of each dataset from the beginning, including its fields and structure. When a data pipeline is designed and enforced by the data governance, the meta-data traceability should be established at the same time. Today, meta-data lineage tracking is a must-have capability for any data governance tool on the market, which makes it easier to store and trace through datasets and fields by a few clicks, instead of having data experts go through documents, databases, and even programs.
Data traceability is more difficult than meta-data traceability. Below lists some common techniques to enable this ability:
Trace by unique keys of each dataset: This first requires each dataset has one or a group of unique keys, which is then carried down to the downstream dataset through the pipeline. However, not every dataset can be traced by unique keys. For example, when a dataset is aggregated, the keys from the source get lost in the aggregated data.
Build a unique sequence number, such as transaction identifier or record identifier when there are no obvious unique keys in the data itself.
Build link tables when there are many-to-many relationships, but not 1-to-1or 1-to-many.
Add timestamp (or version) to each data record, to indicate when it is added or changed.
Log data change in a log table with the value before a change and the timestamp when the change happens
Data traceability takes time to design and implement. It is, however, strategically critical for data architects and engineers to build it into the pipeline from the beginning; it is definitely worth the effort considering it will save a tremendous amount of time when a data quality issue does happen. Furthermore, data traceability lays the foundation for further improving data quality reports and dashboards that enables one to find out data issues earlier before the data is delivered to clients or internal users.
Obviously, data quality issues often occur when a new dataset is introduced or an existing dataset is modified. For effective change management, test plans should be built with 2 themes: 1) confirming the change meets the requirement; 2) ensuring the change does not have an unintentional impact on the data in the pipelines that should not be changed. For mission critical datasets, when a change happens, regular regression testing should be implemented for every deliverable and comparisons should be done for every field and every row of a dataset. With the rapid progress of technologies in big data, system migration constantly happens in a few years. Automated regression test with thorough data comparisons is a must to make sure good data quality is maintained consistently.
Lastly, 2 types of teams play critical roles to ensure high data quality for an organization:
Quality Assurance: This team checks the quality of software and programs whenever changes happen. Rigorous change management performed by this team is essential to ensure data quality in an organization that undergoes fast transformations and changes with data-intensive applications.
Production Quality Control: Depending on an organization, this team does not have to be a separate team by itself. Sometime it can be a function of the Quality Assurance or Business Analyst team. The team needs to have a good understanding of the business rules and business requirements, and be equipped by the tools and dashboards to detect abnormalities, outliers, broken trends and any other unusual scenarios that happen on Production. The objective of this team is to identify any data quality issue and have it fixed before users and clients do. This team also needs to partner with customer service teams and can get direct feedback from customers and address their concerns quickly. With the advances of modern AI technologies, efficiency can be potentially improved drastically. However, as stated at the beginning of this article, quality control at the end is necessary but not sufficient to ensure a company creates and sustains good data quality. The 6 steps stated above are also required.
https://fr.wikipedia.org/wiki/Data_quality_management
Le data quality management (en français : gestion de la qualité des données) est une méthode de gestion des informations ayant pour objectif de gérer et de comparer des données entre différents systèmes d'information ou bases de données d'une entreprise.
En règle générale, il s'agit de transformer des données de qualité en renseignements utiles qui sont essentiels à l’entreprise.
Le DQM remplit les mêmes objectifs que la gestion des données de référence #476 .