VForWaTer / metacatalog

Modular metadata management platform for environmental data.
https://vforwater.github.io/metacatalog
GNU General Public License v3.0
3 stars 1 forks source link

Add DQ_DataQuality #79

Open mmaelicke opened 4 years ago

mmaelicke commented 4 years ago

For data quality, we need kind of a workaround. We will not receive most of the information needed to fully describe the data lineage. Both, the data lineage as well as a record of data quality measures are required to satisfy ISO 19115. In metacatalog, we would have to implement a set of new tables. Both, lineage and data quality reports have a 1:m relationship to Entry.

Lineage

I can't see how we can implement source and processing steps into lineage here.

Report

Reports are basically chronological combinations of a registered certified process identificator combined with a free-text description and a list of possible outcomes. As we design metacatalog to use of quality checked data, we would implement a granular, highly specialized scheme to store information that we don't want to have in most cases or we won't get from data holders. At first glance, there are more than 20 tables necessary to describe possible outcomes.

The only possibility I see here is to define some (like 3.) quality measure outcomes that are available in metacatalog and map them to ISO 19115 on export. We still have the issue, that each of the implemented quality measure results needs a citeable authority that standardized this particular outcome in the first place. ISO requires a citation of this authority to identify data quality measure outcomes. So if we come up with our own stuff here, we need to publish a controlled CodeList, I guess.

I am not sure how to handle this in metacatalog and ideas are highly appreciated. @sihassl @MarcusStrobl At the end of the day, at least 1 record has to be in report and lineage. The more complicated question will be how to handle that information on import, if we can't map it.

AlexDo1 commented 2 years ago

This issue is very relevant as it is about data quality implementation and already contains a lot of information and potential problems that may occur. I would definitely keep this topic.