Open marcel1334 opened 5 years ago
Suppose you would have these percentiles for dates, how would this help you in RaH? What decisions would you make differently with that knowledge? Can you give an example?
Why are trends important when you do ETL? you need to make sure that all the data ends up in the CDM as in the source.
Can you give examples?
Question here is what is the function of WR exactly?
1) enabling ETL by providing input in source structure, vocabularies etc. 2) a profiling tool for source (and possibly CDM) database to get insight in all kind of quality issues etc?
For me it is 1) and 2) is another validation type tool. I think it is important to define its scope clearly within the set of tools that exist or will be developed.
When designing an ETL, I first want to have good insight in the underlying database. Getting an impression of a sourcedatabase and its content before starting the ETL design. I don’t think people many people will start by looking at the table and attribute names and directly start the ETL design. Getting these insights can also surface quality issues, but this is not necessary the main goal in this phase.
Is comparable with having a first look at a new converted CDM database: most people will start by looking at the Achilles profiles to get a feeling. But this is only available once all the hard works is done and the database is already converted.
For our own IPCI ETL (and also for the Danish Aarhus database ETL) when Patrick, Martijn and Erica were helping us to get started, the first thing they asked was to run WhiteRabbit on our sourcedatase. And we started looking at the WR excelsheet to talk about the source database. In the discussion for IPCI we often used our own DataProfiler (also containing histograms) to explain them the content of our database. But not every sourcedatebase has profiling tools.
WR only had the top 100 (or 1000?) most frequent values, and this did not give much insight in the content of the database. It told us the datatype (numeric, text, date) and sometimes that a field was always empty (and therefore can be ignored in the ETL). It would be helpful to add a couple of extra parameters.
Was this situation 1 or 2? Clearly 2: getting a good insight in the sourcedatabase. And we used WR for this. We can think of a separate tool, for this. But it also has a lot of overlap with the current WhiteRabbit features. Another option is to combine this, since for both the dataset has to be scanned.
Another way to look at this is to consider WR a sourcedatabase profiler that can export a smaller file to start RabbitInAHat. Input for RabbitInAHat is a subset of the profiling data.
But if all this is will be possible with a specific sourcedatabase profile.. Super!!
@PRijnbeek Another example is that a standard question during a mapping workshop is when data collection started and when it ended. Or: are we going to expect events outside the observation period? These min/max dates would answer that.
The quartiles would give some additional information on the spread of dates.
So I would definitely say that this feature is 1): enabling ETL.
@MaximMoinat, @marcel1334 as long as it is impacting the ETL itself we of course agree here.
The example of the data collection start and end, or events outside of observation time, indeed may be interesting however that knowledge will not really impact the ETL. You apply the THEMIS rules anyway and should test them using a testing framework. On the other hand, I agree adding this also does not hurt...
"WR only had the top 100 (or 1000?) most frequent values, and this did not give much insight in the content of the database. It told us the datatype (numeric, text, date) and sometimes that a field was always empty (and therefore can be ignored in the ETL). It would be helpful to add a couple of extra parameters."
Not sure i see the point here. The insight you need is to understand what type of data is in the column and in what coding system (unless you have a multi-purpose column). Not clear to me why you would need to know more than 100 examples for this. The code mapping process is another step in the process.
Finally, I do think we need to keep away from the Quality assessment of source data in this tool and first focus on getting all the source data in the CDM and apply the newly developed framework there.
To make a bit more clear what we are talking about, I made a mockup of what this would look like in the scan report. Next step same for dates and present this information in RiaH.
@marcel1334 Is this what you were aiming for? Any suggestions?
@MaximMoinat This kind of analysis seems more appropriate to do on the source data... By the conversion to concept ID, the columns are not mapped as a concept per column, where a distribution / trend may be useful to identify.
Sorry, I created a confusing example. The idea is indeed to execute this on the source data. So we will not see the concept_id
s
You do raise a good question. Ideally, we would like to distinguish between numeric values and identifiers (e.g. patient ids). An idea is to hide the analysis for any fields ending in 'id'.
Although the example shows a CDM table, the idea priciple of the mockup looks fine.
In the CDM database we could hide the *_id fields becuase its in the naming convention. But since we do this on the source database, we don't know the naming convention. Its even likely that it is in another language. So, I guess filtering the statistics based on the fieldname wont work that well. I don't think its a problem to show the stats for all fields. The max numeric values can also be used for field sizes.
Besides integer stats we can do the same stats for floats and dates. And for character types we can show the stats for the number of characters.
It would be great if it also calculates the min, 25%, 50%, 75%, max values to enable boxplots. Not only for numeric values, but also for dates.
To get insights in trends and/or gaps in the values and date, most frequent values cannot be used. Having all possibel values and their counts in the scanreport can get huge. Another option is to collect all values internally first and if there are too many values, group them and make this available in scan result. For example if there are too many dates, group them by year+month. Numeric values: "0-10", "11-20", "21'30", etc.
For textvalues we can do a similar thing with the lengt of the value. Usable to get an idea of the volume of discharge letters.