CorrelAidSwitzerland / a4d

The project repository for the Data4Good project in collaboration with A4D
Other
1 stars 0 forks source link

Handle different cases like upper and lower case for text columns #76

Open pmayd opened 1 year ago

pmayd commented 1 year ago

Right now we do not change the content of a character column like status so the actual value can be written in any way. That is why we have Active and active, among other cases. This is even more true for other character columns that do not have a limited range of allowed values.

So maybe we can spend some time aligning different writing cases so that we have more matches and less unique values

Focus on the columns that are important for the reports

lboel commented 1 year ago

Probably would start with some string strip and regex comparison approach to a set of given "expected values" and additional synonym checks

so each column could have a set of allowed values with nested synonyms as well (e.g for "insulin regimen" this might be relevant) represented in a yaml/json structure:

restricted_content_spec.json

  {
    "column1": 
      {
        "allowed": [
          "test1", 
          "test2"
        ],
        "synonyms": {
          "test1": [
            "testone", 
            "testuno"]
        }
      }
   }

so the "restricted_content_checker" could look for any specification based on the column name (here "column1") if it exists it does a str_split/regex approach to test the content against the allowed values and replaces them. In a second step the same checks are repeated for all the manually specified synonyms.

pmayd commented 1 year ago

@KonradUdoHannes maybe this is something worth of your time ;)

KonradUdoHannes commented 1 year ago

@pmayd the cleaning step in our pipeline currently makes use of check_allowed_values with hardcoded configurations for a few columns for patient data. In particular

How final are the configurations for these allowed value vectors? For instance support_from_a4d has a lot of issues with the string data for combinatorial reasons. Some entries are SMBG, Insulin and hba1c and combinations of these in different order and with different capitalizations. This seems like potentially relevant information to me, which we could clean up in detail. On the other hand non of the three entries or their combinations is currently an allowed value according to the list.

KonradUdoHannes commented 1 year ago

@pmayd Do you have any thoughts/opinions on this or should we maybe discuss this in a larger round in slack? In the end cleaning the strings more has to go hand in hand with adjusting the allowed values.

pmayd commented 12 months ago

I will try to find time in the evening

pmayd commented 12 months ago

Yea, check_allowed_values was implemented in the most easiest way and I think @lboel already made a good suggestion to move the hardcoded values out of every single call and into a centralized YAML file under reference data that is easier to maintain. This would be ideal as it means we can simply pass the column name to the function and the function can internally load the YAML mapping and look up the allowed values for this column as well as if any invalid value should be replaced with an error value, and also specify this error value. So that would be my favorite solution.

As of how final these configurations are: not final at all, Tyla has never agreed to anything we do....I don't know how many times I have to ask her and tell her she needs to give us a OK in this regard but I guess she just trusts us so as I said with a YAML we have the best control and also the best documentation about what we are doing here. Maybe list all the problems you see at the moment and lets ask Tyla how we should handle these. Insulin regimen is actually more used than support from A4d, I am not aware of any use case for the latter but the first is important as insulin regimen is likely something they aggregate by to see how users of different regimens fare against each other. So it would be great to improve this BUT the earlier trackers were not selected from a dropdown list in this filed but rather free text so it will be impossible to match these, we can only do so for the latest data like 2020 - 2022, more details are in the excel file I guess

lboel commented 11 months ago

this looks promising: https://statisticsglobe.com/fuzzy-matching-r#stringdist-package for fuzzy matching the text columns to some target values.

lboel commented 11 months ago

@KonradUdoHannes

pmayd commented 11 months ago

@KonradUdoHannes here is an example of what Dr May is calculating from the longitudinal data for publications:

from_tyla_KH_Apr'22 Publication Summary Data (1).xlsx

As you can see, there are exactly the two categories for insulin regime we were talking about so hopefully that helps