Health-Informatics-UoN / lettuce

MIT License
2 stars 0 forks source link

Pipeline for non-drug concepts #40

Open kuraisle opened 2 months ago

kuraisle commented 2 months ago

Now we have shown that LLettuce can work for drug concepts, we need to expand to non-drug concepts.

This could be as simple as removing references to "domain = 'Drug'" from queries. However, the OMOP vocabularies are large, and querying the whole database will be slow. However, we could get NLP to help us. Here's a rough scheme:

graph TD
    input --> domain[Guess the domain of input]
    domain --> class_m[Guess the class of input]
    class_m --> search[Semantic search]
    search --> search_res_class{Acceptable guess?}
    search_res_class --Yes--> User
    search_res_class --No--> search_no_class{Search - no class}
    search_no_class -- Acceptable --> User
    search_no_class -- No --> search_no_domain{Search - no domain}
    search_no_domain --Acceptable--> User
    search_no_domain --No--> Surrender

Estimating the class will be less useful - the main thing will be to narrow it down to domain. I would guess it's harder for an NLP system to achieve, too, so we can test, but

For this to work we will need a new pipeline

kuraisle commented 2 months ago

Using Co-connect twins as an example

The dataset

We have a dataset to use as an example for adding non-drug domains to LLettuce's use-case. This is the TwinsUK phenobase. The part of this that's interesting for us is the "Variables" sheet of a spreadsheet I was sent. Within this, there are two columns that are interesting:

PhenotypeName PhenotypeDescription
Sensitivity to Allergens Score of subjects' sensitivities to 112 allergen components using frozen serum.
Clotting factor Results for VWF (von Willebrand factor)
... ...

There are 8144 of these name/description pairs. In OMOP, there's a "CO-CONNECT TWINS" vocabulary. 4234 of the PhenotypeNames match a CO-CONNECT TWINS concept. I've retrieved the standard concepts for these non-standard concepts. This provides a nice example for us to test versions of LLettuce. The PhenotypeDescription is the kind of long description of something that LLettuce is well positioned to parse into standard concepts. By making this PhenotypeDescription -> PhenotypeName -> CO-CONNECT TWINS -> OMOP standard concept chain, I've made a table of:

PhenotypeDescription OMOP standard concepts
About how long did you smoke for in total? - months ((:relationship "Maps to" :concept "Cigarette smoker"))
About how long did you smoke for in total? - years ((:relationship "Maps to value" :concept "Cigarette smoker")(:relationship "Maps to" :concept "History of event")(:relationship "Maps to" :concept "Currently doesn't use tobacco or its derivatives"))
... ...

Getting LLettuce to predict the right column from the left is what we want to test. The exact format of how the OMOP standard concepts is represented isn't important, it could be JSON or whatever, as long as it can be parsed into a set of relationships to concepts. An important thing to note is that a mapping can be made to multiple concepts.

Preliminary test

I fine-tuned Flan-T5-small on 80%/10% train/test split of the dataset. It did OK, given the small size of the model. I calculated the precision, recall, and $F_1$ score of this against a 10% validation (or holdout) set.

Future direction

A useful comparison to make will be between a fine-tuned Flan-T5 model and Llama 3.1. The steps for this will be: