Open josephjclark opened 1 week ago
Maybe we can ICD10 codings as a first pass? We can just map disease diagnoses and manually cherry-pick inputs.
That gives us an easier pipeline, which we can then go on to extend to use LOINC and SNOWMED
Ok, this will really help us slice down the basic mappings.
What we're trying to do here is generate the sheets the lookup-tables xlsx.
Each sheet is tightly contextualised: mapping a list of observations (as commcare strings) to a particular subset of loinc codes.
Our input to this workflow is going to be more like "For this particular commcare form or input, extract the inputs values, and for each of thise, map them to one of these loinc observartions (and by the way, map them as fhir codings in this format)
CC @hanna-paasivirta
@hanna-paasivirta I've updated the issue based on yesterdays updates. We'll talk about it this morning.
The first mapping service, Aisha's mapping service, is designed to be the simplest form of mapper we can get.
The mapper is designed to generate a single sheet of the lookup tables spreadsheet. To get all the lookup tables needed for the asri-satusehat project, you have to run the mapper multiple times.
It'll work like this:
And that's it.
The payload needs to be like:
(this example maps to ICD10 codes, not fhir)
We will want to add some script or CLI support to build this payload from files (the openfn CLI could help here)
Design Notes
In this system, do we even need a database and embeddings?? We're asking the human to slice down the dataset for us so we may not even need rag. Having said that, a) If we run the model for each individual mapping, then maybe we still want to use rag even from a small list of inputs, and b) if we build it with rag now, then we can scale better later.
Ok, so we probably do want to use embeddings. But I think we still need to dynamically embed the inputs from the user, rather than pulling from a larger corpus.
This is a tricky detail and key to the whole design.
Mapping Notes
Which sheet do we want to start with? They all have problems...
Validation
In Aisha's project specifically, we have the actual answers to the mapping in the spreadsheet.
We should be able to use that information to build formal validations of the results. Like we can literally diff the output and report a success score
Maybe, more generally, if you're doing some mapping and you get some results out and verify them, but then you want to go back and regenerate the mappings because you've added new inputs (or one turned out to be wrong). You might well want to pass your previous result as a validation set to make sure things aren't moving
CLI Integration
I don't know if we want to integrate this into the CLI as part of this work, but I think it's helpful to look ahead and think about what the CLI interface might look like
Limitations
Obviously this is a very limited system!
The fhir thing isn't quite right is it, because we need to map to ICD10 for diagnosis and FHIR for body sites.
Next Steps
We need to go on to solve the following problems. These are part of the main top level epic really, but I want to be mindful of upcoming problems and possible solutions.
loinc.diseases
, and then set the "target dataset" input toloinc.diseases
? And if so, how does the user know what datasets are available, and how do they add more?Take these commcare strings and map them to disease codes in the attached loinc dataset, and encode them as fhir codings with a display, system and code
. I have two concerns about this: unless you know the pattern, is it too hard to provide the text? And what if key information is missing, or the AI fails to extract it?