Open pmayd opened 1 year ago
Probably would start with some string strip and regex comparison approach to a set of given "expected values" and additional synonym checks
so each column could have a set of allowed values with nested synonyms as well (e.g for "insulin regimen" this might be relevant) represented in a yaml/json structure:
restricted_content_spec.json
{
"column1":
{
"allowed": [
"test1",
"test2"
],
"synonyms": {
"test1": [
"testone",
"testuno"]
}
}
}
so the "restricted_content_checker" could look for any specification based on the column name (here "column1") if it exists it does a str_split/regex approach to test the content against the allowed values and replaces them. In a second step the same checks are repeated for all the manually specified synonyms.
@KonradUdoHannes maybe this is something worth of your time ;)
@pmayd the cleaning step in our pipeline currently makes use of check_allowed_values
with hardcoded configurations for a few columns for patient data. In particular
How final are the configurations for these allowed value vectors? For instance support_from_a4d
has a lot of issues with the string data for combinatorial reasons. Some entries are SMBG, Insulin and hba1c and combinations of these in different order and with different capitalizations. This seems like potentially relevant information to me, which we could clean up in detail. On the other hand non of the three entries or their combinations is currently an allowed value according to the list.
@pmayd Do you have any thoughts/opinions on this or should we maybe discuss this in a larger round in slack? In the end cleaning the strings more has to go hand in hand with adjusting the allowed values.
I will try to find time in the evening
Yea, check_allowed_values
was implemented in the most easiest way and I think @lboel already made a good suggestion to move the hardcoded values out of every single call and into a centralized YAML file under reference data that is easier to maintain. This would be ideal as it means we can simply pass the column name to the function and the function can internally load the YAML mapping and look up the allowed values for this column as well as if any invalid value should be replaced with an error value, and also specify this error value. So that would be my favorite solution.
As of how final these configurations are: not final at all, Tyla has never agreed to anything we do....I don't know how many times I have to ask her and tell her she needs to give us a OK in this regard but I guess she just trusts us so as I said with a YAML we have the best control and also the best documentation about what we are doing here. Maybe list all the problems you see at the moment and lets ask Tyla how we should handle these. Insulin regimen is actually more used than support from A4d, I am not aware of any use case for the latter but the first is important as insulin regimen is likely something they aggregate by to see how users of different regimens fare against each other. So it would be great to improve this BUT the earlier trackers were not selected from a dropdown list in this filed but rather free text so it will be impossible to match these, we can only do so for the latest data like 2020 - 2022, more details are in the excel file I guess
this looks promising: https://statisticsglobe.com/fuzzy-matching-r#stringdist-package for fuzzy matching the text columns to some target values.
@KonradUdoHannes
@KonradUdoHannes here is an example of what Dr May is calculating from the longitudinal data for publications:
from_tyla_KH_Apr'22 Publication Summary Data (1).xlsx
As you can see, there are exactly the two categories for insulin regime we were talking about so hopefully that helps
Right now we do not change the content of a character column like
status
so the actual value can be written in any way. That is why we haveActive
andactive
, among other cases. This is even more true for other character columns that do not have a limited range of allowed values.So maybe we can spend some time aligning different writing cases so that we have more matches and less unique values
Focus on the columns that are important for the reports