Open kuraisle opened 2 months ago
We have a dataset to use as an example for adding non-drug domains to LLettuce's use-case. This is the TwinsUK phenobase. The part of this that's interesting for us is the "Variables" sheet of a spreadsheet I was sent. Within this, there are two columns that are interesting:
PhenotypeName | PhenotypeDescription |
---|---|
Sensitivity to Allergens | Score of subjects' sensitivities to 112 allergen components using frozen serum. |
Clotting factor | Results for VWF (von Willebrand factor) |
... | ... |
There are 8144 of these name/description pairs. In OMOP, there's a "CO-CONNECT TWINS" vocabulary. 4234 of the PhenotypeNames match a CO-CONNECT TWINS concept. I've retrieved the standard concepts for these non-standard concepts. This provides a nice example for us to test versions of LLettuce. The PhenotypeDescription is the kind of long description of something that LLettuce is well positioned to parse into standard concepts. By making this PhenotypeDescription -> PhenotypeName -> CO-CONNECT TWINS -> OMOP standard concept chain, I've made a table of:
PhenotypeDescription | OMOP standard concepts |
---|---|
About how long did you smoke for in total? - months | ((:relationship "Maps to" :concept "Cigarette smoker")) |
About how long did you smoke for in total? - years | ((:relationship "Maps to value" :concept "Cigarette smoker")(:relationship "Maps to" :concept "History of event")(:relationship "Maps to" :concept "Currently doesn't use tobacco or its derivatives")) |
... | ... |
Getting LLettuce to predict the right column from the left is what we want to test. The exact format of how the OMOP standard concepts is represented isn't important, it could be JSON or whatever, as long as it can be parsed into a set of relationships to concepts. An important thing to note is that a mapping can be made to multiple concepts.
I fine-tuned Flan-T5-small on 80%/10% train/test split of the dataset. It did OK, given the small size of the model. I calculated the precision, recall, and $F_1$ score of this against a 10% validation (or holdout) set.
A useful comparison to make will be between a fine-tuned Flan-T5 model and Llama 3.1. The steps for this will be:
Now we have shown that LLettuce can work for drug concepts, we need to expand to non-drug concepts.
This could be as simple as removing references to "domain = 'Drug'" from queries. However, the OMOP vocabularies are large, and querying the whole database will be slow. However, we could get NLP to help us. Here's a rough scheme:
Estimating the class will be less useful - the main thing will be to narrow it down to domain. I would guess it's harder for an NLP system to achieve, too, so we can test, but
For this to work we will need a new pipeline