Closed eric-czech closed 9 months ago
Details for this are in notebooks/graph_extraction.ipynb.
To summarize, I wasn't expecting much from this experiment and was surprised by how well it appears to work. The design for it was very simple:
And that was it. I took the json that comes back from this prompt, converted it to networkx, suffered through the usual network visualization nuisances, and was able to quickly get a picture like this:
I'm not certain if this is the best way to categorize the most relevant therapeutic opportunities based on the input text, but it definitely seems plausible. Here is a table showing how the various concepts were classified (Interventional
= more likely to be disease-modifying, Management
= more likely to manage symptoms):
Lastly, I ran this analysis multiple times since the prompts are so unconstrained and the variance in the results between runs is pretty high. I cherry-picked what I would rate as the best result above. You can see all of them though in https://github.com/eric-czech/ngly1-gpt/tree/main/ngly1_gpt/notebooks/exports.
A detailed description of the setup, prompts, results, etc. for this is in notebooks/relation_extraction.ipynb.
This prompt summarizes the intent of the experiment well: https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/relation_extraction_1.txt#L1-L116
The predicates and entities used in that prompt are coming from a combination of those mentioned in https://github.com/SuLab/DrugMechDB#curation and the statements/concepts at https://github.com/SuLab/ngly1-graph/tree/master/neo4j-graphs/ngly1-v3.2/import/ngly1, as well as one or two more I added myself (e.g. clinical trials).
The prompt above was run over chunks of text across the two papers manually curated in PMC7153956, i.e. PMC7477955 and PMC4243708.
Specific examples of input text -> output relations look quite encouraging. Cherry-picking a couple of these (with more in the notebook mentioned above), I see that messy inputs like this (from PMC7477955) are handled as expected:
Same here:
Looking more broadly at the frequency of different co-occurring entities in the relations across both papers, i.e. all extractions like those above, I see a picture like:
This is more or less what I would have expected too. This experiment doesn't include grounding or comparisons to external datasets, so I won't read into this too much, but it does demonstrate that nearly all of the entities suggested in the prompt are recognized to some extent. It also shows that I am not getting back entity types (i.e. subject/object types) that I did not ask for.
The resulting predicates, on the other hand, are more unruly. Here are relation counts by predicate and whether or not I was "expecting" them, i.e. expected => explicitly stated in the prompt and unexpected => GPT4 decided to include them on its own:
Here are all of the "unexpected" predicates with their associated concepts:
And here are a few I would say imply useful potential improvements to the prompt:
postulated
to play a role in -> Charcot-Marie-Tooth disease
not
associated with -> ocular apraxia
[PMC7477955] In addition, we could not confirm abnormal storage material in the three liver biopsy specimens we analyzed, nor did we observe ocular apraxia in our individuals, despite performing detailed ophthalmologic evaluations that involved some subjects previously reported to have ocular apraxia.
not
associated with -> lipodystrophy
[From PMC4243708] Unlike CDGs, NGLY1 deficiency does not appear to be associated with cerebellar atrophy, lipodystrophy, or significant heart manifestations.
I would speculate that one potential explanation for why it keeps giving me predicates I didn't ask for is that I am prompting it to extract ALL relations without telling it what to do when it encounters a subject + object in my list of subject/object types and a predicate relating them that is not in my list of specific predicates. It might be possible to resolve that contention by telling it to drop those cases. They're helpful though IMO, and easy enough to filter out post-hoc.
This is probably my favorite result from this experiment because it immediately suggested that it's possible to start answering a number of important questions about genetic disease etiology that I hadn't thought about when designing the prompt. This table below is showing a subset of the relations extracted where either the subject or the object is a genetic variant
or protein variant
.
Here are all the distinct types of relations in this table, i.e. if you omit the specific subject/object:
I also looked more closely at any relations between NGLY1 deficiency
and another entity falling into the phenotype
, disease
, or symptom
categories. There are 290 of these terms identified across both papers and as far as I can tell so far, the vast majority of them are correctly attributed. Grounding them and comparing them quantitatively to the phenotypes from PMC7153956 should be possible (i.e. https://github.com/SuLab/ngly1-graph/tree/master/neo4j-graphs/ngly1-v3.2/import/ngly1). That would certainly help say more about accuracy.
More qualitatively though, it's clear that these terms are all (or very close to all) correctly being classified as phenotypes/diseases/symptoms by GPT4. The sheer volume and specificity is also encouraging:
These categorizations of the terms (e.g. Metabolic Disorders
, Brain & Neurological Conditions
) are coming from dumping all 290 terms into a prompt, asking GPT4 to define some number of categories for them and then spit me out a mapping from the terms to the categories. You can see the prompt for this here: https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/phenotype_categories_1.txt#L1-L20
I think that's a good LLM use case in problems like this -- they are definitely good at building taxonomies/ontologies.
Anyways, the dropdowns below show the full list of categories and visualizations of more terms for them:
This experiment was a lot more involved than the others and attempted to extract as much detail as possible on an individual patient level from the same 2 papers as PoC 2. If something like this was deployed on a larger scale, it could enable some very interesting use cases like phenotype frequency inference and automated rare disease meta-analysis (among others). I wanted to show that it works on a smaller scale first since there is plenty of complexity inherent to how those 2 papers present patient-specific data.
Overall, my read is that this works quite well and modulo some likely solvable problems, I'm quite happy with how little code this took and the potential it implies.
Here is a visual outline for the pipeline:
Prompts in the outline:
A few other high-level notes:
patient_accession
everywhere) and 2) merge details into a single record for each patientThese logs are a good indicator of how well an LLM can attribute details to individual patients: extract_patients.log.txt.
These two papers end up providing a nice contrast because PMC7477955 does a relatively poor job of associating data to specific patients in the main text and favors summaries more like "Bone age was delayed in eight of the 11 subjects". There are a number of supplementary tables with more individual-level data, but by and large PMC8997433 is much better at associating findings with individuals in both a structured and unstructured fashion.
On the structured front, here is an example from the log of a table (Table 1) in PMC8997433 and how it is converted into essentially unstructured text. I'm later using these unstructured details
to populate the patient records:
I was worried about this round-trip from table -> text -> json and rightfully so. This does introduce some errors I think could probably have been avoided. You can see more on that in the Table replication section below. If I did this again, I would definitely redesign this step by looking for a token-efficient output format that better supports a mixture of nested, structured and unstructured data. Yaml may be a better choice.
Once all the "details" are collected, I am randomly sampling them and providing them to the LLM with a prompt to generate a JSON Schema. This is done with a single completion and here is a log showing the prompt + response: infer_patients_schema.log.txt.
And here is the kind of schema it comes up with:
I think this is a fiddly bit that would be best to do without in the future, in favor of just manually building a schema or drawing on schemas from something like MIMIC2 or phenopacket. That latter is particularly relevant given:
The goal of the phenopacket-schema is to define the phenotypic description of a patient/sample in the context of rare disease, common/complex disease, or cancer. The schema as well as source code in Java, C++, and Python is available from the phenopacket-schema GitHub repository.
Reminder to self: Talk to @pnrobinson about the state of phenopacket again.
All of the extracted records are in here: https://github.com/eric-czech/ngly1-gpt/blob/e66f513f39a3a8d02a63615f9e7f4a57fa8e76ef/data/output/patients.json#L1-L22
A random sample of 4 of them looks like this:
The differences in sparsity between the two studies might looking concerning at first glance, but they do make sense when you dig in on the papers. PMC7477955 doesn't describe most phenotypes for patients in a way that makes it possible to track them back to individuals. It does though give a bunch of supplementary tables with lab panels, EEG findings, nerve conduction studies, etc. Here is one on seizures and EEGs, which is almost certainly why the LLM chose to create a separate seizure_info
field: https://github.com/eric-czech/ngly1-gpt/blob/e66f513f39a3a8d02a63615f9e7f4a57fa8e76ef/data/extract/PMC7477955.txt#L148-L163
One of the first things I wanted to see is whether or not it was possible to at least get accurate extractions of details like this across studies:
Being able to pull out demographics/mutations like this could obviously speed up writing rare disease review papers a ton, at the very least.
I don't have much more to say on this other than that I've reviewed every piece of information there and didn't find any mistakes for any patients in either study. So it's a good start!
Inferring frequency of phenotypes also appears to work reasonably well. I'll note that this is including pieces of information like this where individual attribution is not possible (from PMC4243708):
Other common findings included hypo- or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8)
This means that there may be multiple estimates of frequencies for the same phenotype if the text and tables incorporate the same information. That's shown below in the all_frequencies
field, which gives the patient count / total patients
ratio for any one occurrence of a frequency estimate:
An important omission from this is that I'm using the same denominator (i.e. total patient count across a study) for all frequencies. That's certainly not always the case since some assessments/measurements aren't made for all patients, e.g. in the excerpt I included above. I think that could probably be accounted for in the future without too much trouble though.
In order to say more about validity/accuracy, I checked some of the tables in the papers to see how well reproductions of them from the individual patient records lined up. I think this figure sums that up pretty well:
See the detailed notes in the dropdown below for what is incorrect for PMC4243708: Table 1.
The two strategies that come to mind first that might mitigate some of the errors are:
There are 3 experiments I want to run in this repo to gather some information on building/managing structured information extraction pipelines with LLMs (i.e. GPT4) over rare disease literature. These are inspired in large part by Structured reviews for data and knowledge-driven research (2020):
Motivation
My current thinking is that relation extraction (RE) and entity recognition (NER) in biomedical literature is still typically more accurate with the previous generation of fine-tuned, BERT-based models, e.g. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. That paper compared PubMedBERT to GPT4 on common tasks in this space and showed GPT4 was comparable, but not as good. So why run these experiments? My experience with results from models like PubMedBERT is that though they do perform well on broad benchmarks, the fidelity just isn't there when you zoom in on what they give you for a highly-specific area of biology. My interest is in demonstrating that zero or few-shot prompts and in-context learning can be tuned to a more specific task such as knowledge graph construction for rare disease biology, to some net-positive effect. I also suspect that priming these extraction tasks with a downstream application, e.g. identifying candidate genetic disease mechanisms for therapeutic intervention, could be useful since there are so many ways to build a knowledge graph and no one way to do it solves every problem.
I also believe that great LLMs have the capability now to accurately extract individual rare disease patient details at scale, which to my knowledge has not been significantly automated before.
That said, the experiments and their results are summarized in separate posts below: