OHDSI / WhiteRabbit

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
http://ohdsi.github.io/WhiteRabbit
Apache License 2.0
173 stars 85 forks source link

Add ability to track ETL unit test coverage #180

Open schuemie opened 4 years ago

schuemie commented 4 years ago

As discussed with Andrew Williams, it would be very helpful (also as a data quality metric) to be able to compute coverage of the unit tests.

One way to compute this would be to store the field-to-field mappings defined in RiaH in the testing framework. We could then compute which of these mappings are covered in at least one expect_ statement, and probably more importantly which ones are not covered.

AEW0330 commented 4 years ago

In addition to coverage by unit tests, it would be helpful to pull the logic for unit tests directly from the DQ Dashboard. The idea would be to use the field-to-field mappings defined in RiaH to identify all the data checks in the DQD for the target OMOP field. The logic and values in the unit test might then be built automatically from the pattern defined by the DQD check type and the source and target fields in RiaH.

The implication for the coverage calculation, is that multiple unit tests would apply to different DQ attributes of the target field in OMOP. The denominator in the coverage calculation, therefore, would not be the count of each piece of code per mapping. It would be the count of all DQ attributes related to the proper execution of each piece of mapping code: conformance, consistency with other data, etc.

Not sure if this belongs in a separate issue: A related piece of work in RiaH would be to enable relevant ETL snippets to be retrieved and automatically displayed for failed DQ checks inside the DQ Dashboard. The DQD now shows the DQ check as implemented in SQL. Pulling the relevant ETL snippet automatically would allow ETLers to quickly determine whether the check failed because of a coding error or a characteristic of the source data.

I suspect automation of all this might be greatly facilitated by using GUIDs for the OMOP version, instance, table, field, source concept, standard concept (if different) and DQ check. We don't have those for the CDM version, instance, and DQ check yet. Maybe use check sums or the output of a fingerprinting function for the instances until the community settles on whether/how to uniquely identify OMOP versions, instances? The GUIDs for checktypes could be a simple enumeration or use some meaningful encoding of the relevant check type, field, and concept. Regardless of whether those are the proper IDs, something that tags the relevant section of ETL code so that it could be pulled would be needed. Making tags do double work within the ETL by enabling them to be used in the ETL coding if that's possible, seems like it might be useful because of the save extra effort it would save for tagging and documentation.

MaximMoinat commented 4 years ago

Hi Andrew, I feel like I need to clarify a few things.

RiaH does not implement the ETL, it only documents the mapping rules in free text. The information we do have is which source fields should map to which target fields. So extracting the relevant ETL code cannot be done by RiaH.

And I do not see how we can pull the logic of the unit tests from the DQD. The input for a unit test is a given source record and the expected omop record (after ETL). As the DQD does not have any knowledge about the source data, it cannot produce the unit tests. Does this make sense?

@schuemie Will work on a simple function that outputs mapping coverage. In past projects, we already used a code snippet to output the result of the unit tests.

schuemie commented 4 years ago

Thanks!

I think perhaps what @AEW0330 is looking for is some structured representation of the unit tests to be included in the DQD? So if in the DQD we click on a particular CDM field, we can see what unit tests were specified for getting the data into that field. This might be a bit tricky, since we have to link the source data creation statements (add_...) to specific expect statements.