Now we have shown that LLettuce can work for drug concepts, we need to expand to non-drug concepts.

This could be as simple as removing references to "domain = 'Drug'" from queries. However, the OMOP vocabularies are large, and querying the whole database will be slow. However, we could get NLP to help us. Here's a rough scheme:

graph TD
    input --> domain[Guess the domain of input]
    domain --> class_m[Guess the class of input]
    class_m --> search[Semantic search]
    search --> search_res_class{Acceptable guess?}
    search_res_class --Yes--> User
    search_res_class --No--> search_no_class{Search - no class}
    search_no_class -- Acceptable --> User
    search_no_class -- No --> search_no_domain{Search - no domain}
    search_no_domain --Acceptable--> User
    search_no_domain --No--> Surrender

Estimating the class will be less useful - the main thing will be to narrow it down to domain. I would guess it's harder for an NLP system to achieve, too, so we can test, but

For this to work we will need a new pipeline

[ ] Add domain estimation to the pipeline. This could be LLM, or zero-shot labelling with embeddings
[ ] Update OMOP ORM models to include domain and class
[ ] Redesign OMOP queries to accept non-drug domains

Using Co-connect twins as an example

The dataset

We have a dataset to use as an example for adding non-drug domains to LLettuce's use-case. This is the TwinsUK phenobase. The part of this that's interesting for us is the "Variables" sheet of a spreadsheet I was sent. Within this, there are two columns that are interesting:

PhenotypeName	PhenotypeDescription
Sensitivity to Allergens	Score of subjects' sensitivities to 112 allergen components using frozen serum.
Clotting factor	Results for VWF (von Willebrand factor)
...	...

There are 8144 of these name/description pairs. In OMOP, there's a "CO-CONNECT TWINS" vocabulary. 4234 of the PhenotypeNames match a CO-CONNECT TWINS concept. I've retrieved the standard concepts for these non-standard concepts. This provides a nice example for us to test versions of LLettuce. The PhenotypeDescription is the kind of long description of something that LLettuce is well positioned to parse into standard concepts. By making this PhenotypeDescription -> PhenotypeName -> CO-CONNECT TWINS -> OMOP standard concept chain, I've made a table of:

PhenotypeDescription	OMOP standard concepts
About how long did you smoke for in total? - months	((:relationship "Maps to" :concept "Cigarette smoker"))
About how long did you smoke for in total? - years	((:relationship "Maps to value" :concept "Cigarette smoker")(:relationship "Maps to" :concept "History of event")(:relationship "Maps to" :concept "Currently doesn't use tobacco or its derivatives"))
...	...

Getting LLettuce to predict the right column from the left is what we want to test. The exact format of how the OMOP standard concepts is represented isn't important, it could be JSON or whatever, as long as it can be parsed into a set of relationships to concepts. An important thing to note is that a mapping can be made to multiple concepts.

Preliminary test

I fine-tuned Flan-T5-small on 80%/10% train/test split of the dataset. It did OK, given the small size of the model. I calculated the precision, recall, and $F_1$ score of this against a 10% validation (or holdout) set.

Future direction

A useful comparison to make will be between a fine-tuned Flan-T5 model and Llama 3.1. The steps for this will be:

[ ] Implement precision, recall and $F_1$ score in the evaluation framework #5
[ ] Split off a validation set that both models can be evaluated on
[ ] Fine-tune a larger Flan model
[ ] Define a prompt Llama 3.1 can use
[ ] Run the fine-tuned Flan-T5 and Llama on the validation set
[ ] Run evaluation metrics and compare models

Health-Informatics-UoN / lettuce

Pipeline for non-drug concepts #40

Using Co-connect twins as an example

The dataset

Preliminary test

Future direction