OpenFn / apollo

GNU Lesser General Public License v2.1
0 stars 2 forks source link

Basic Mapping Pipeline #112

Open josephjclark opened 1 week ago

josephjclark commented 1 week ago

The first mapping service, Aisha's mapping service, is designed to be the simplest form of mapper we can get.

The mapper is designed to generate a single sheet of the lookup tables spreadsheet. To get all the lookup tables needed for the asri-satusehat project, you have to run the mapper multiple times.

It'll work like this:

And that's it.

The payload needs to be like:

{
  keys: ["Typhoid fever", "Dysentery, bacterial", ...],
  vocab: ["A01.0", "A06.0"]
}

(this example maps to ICD10 codes, not fhir)

We will want to add some script or CLI support to build this payload from files (the openfn CLI could help here)

Design Notes

In this system, do we even need a database and embeddings?? We're asking the human to slice down the dataset for us so we may not even need rag. Having said that, a) If we run the model for each individual mapping, then maybe we still want to use rag even from a small list of inputs, and b) if we build it with rag now, then we can scale better later.

Ok, so we probably do want to use embeddings. But I think we still need to dynamically embed the inputs from the user, rather than pulling from a larger corpus.

This is a tricky detail and key to the whole design.

Mapping Notes

Which sheet do we want to start with? They all have problems...

Validation

In Aisha's project specifically, we have the actual answers to the mapping in the spreadsheet.

We should be able to use that information to build formal validations of the results. Like we can literally diff the output and report a success score

Maybe, more generally, if you're doing some mapping and you get some results out and verify them, but then you want to go back and regenerate the mappings because you've added new inputs (or one turned out to be wrong). You might well want to pass your previous result as a validation set to make sure things aren't moving

CLI Integration

I don't know if we want to integrate this into the CLI as part of this work, but I think it's helpful to look ahead and think about what the CLI interface might look like

openfn map-vocab \
    --vocab path/to/file \    # or pre-loaded dataset name
    --inputs path/to/file \
    --prompt "map the inputs to fhir codings like { system, display, code }"

Limitations

Obviously this is a very limited system!

The fhir thing isn't quite right is it, because we need to map to ICD10 for diagnosis and FHIR for body sites.

Next Steps

We need to go on to solve the following problems. These are part of the main top level epic really, but I want to be mindful of upcoming problems and possible solutions.

josephjclark commented 21 hours ago

Maybe we can ICD10 codings as a first pass? We can just map disease diagnoses and manually cherry-pick inputs.

That gives us an easier pipeline, which we can then go on to extend to use LOINC and SNOWMED

josephjclark commented 21 hours ago

Ok, this will really help us slice down the basic mappings.

What we're trying to do here is generate the sheets the lookup-tables xlsx.

Each sheet is tightly contextualised: mapping a list of observations (as commcare strings) to a particular subset of loinc codes.

Our input to this workflow is going to be more like "For this particular commcare form or input, extract the inputs values, and for each of thise, map them to one of these loinc observartions (and by the way, map them as fhir codings in this format)

CC @hanna-paasivirta

josephjclark commented 7 minutes ago

@hanna-paasivirta I've updated the issue based on yesterdays updates. We'll talk about it this morning.