eric-czech / ngly1-gpt

MIT License
1 stars 0 forks source link

Summarize proof of concept analyses #1

Closed eric-czech closed 9 months ago

eric-czech commented 1 year ago

There are 3 experiments I want to run in this repo to gather some information on building/managing structured information extraction pipelines with LLMs (i.e. GPT4) over rare disease literature. These are inspired in large part by Structured reviews for data and knowledge-driven research (2020):

  1. [PoC 1] NGLY1 deficiency graph extraction: Extract entities and relations from a small chunk of highly informative text with very little direction provided to an LLM
  2. [PoC 2] NGLY1 deficiency relation extraction: Extract specific entities and relations between them from 2 full-text papers with a very detailed prompt
  3. [PoC 3] NGLY1 deficiency patient extraction: Extract individual patient clinical, demographic, genetic, etc. characteristics; also uses this information to estimate phenotype frequencies and compare demographics/mutations across studies

Motivation

My current thinking is that relation extraction (RE) and entity recognition (NER) in biomedical literature is still typically more accurate with the previous generation of fine-tuned, BERT-based models, e.g. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. That paper compared PubMedBERT to GPT4 on common tasks in this space and showed GPT4 was comparable, but not as good. So why run these experiments? My experience with results from models like PubMedBERT is that though they do perform well on broad benchmarks, the fidelity just isn't there when you zoom in on what they give you for a highly-specific area of biology. My interest is in demonstrating that zero or few-shot prompts and in-context learning can be tuned to a more specific task such as knowledge graph construction for rare disease biology, to some net-positive effect. I also suspect that priming these extraction tasks with a downstream application, e.g. identifying candidate genetic disease mechanisms for therapeutic intervention, could be useful since there are so many ways to build a knowledge graph and no one way to do it solves every problem.

I also believe that great LLMs have the capability now to accurately extract individual rare disease patient details at scale, which to my knowledge has not been significantly automated before.


That said, the experiments and their results are summarized in separate posts below:

eric-czech commented 1 year ago

[PoC 1] NGLY1 deficiency graph extraction

Details for this are in notebooks/graph_extraction.ipynb.

To summarize, I wasn't expecting much from this experiment and was surprised by how well it appears to work. The design for it was very simple:

  1. Pick a single, representative chunk of text that captures a lot of information on the pathophysiology of NGLY1 deficiency:

https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/data/extract/PMC7477955.txt#L119-L128

  1. Write a prompt to solicit a list of node and edge types for a knowledge graph to be used in accomplishing a task, i.e. identifying therapeutic opportunities:

https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/graph_extraction_1.txt#L1-L9

  1. Use this description of the graph along with the original text to create a literal graph as node-link JSON:

https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/graph_extraction_2.txt#L1-L39

And that was it. I took the json that comes back from this prompt, converted it to networkx, suffered through the usual network visualization nuisances, and was able to quickly get a picture like this:

ngly1_deficiency_graph_annot

I'm not certain if this is the best way to categorize the most relevant therapeutic opportunities based on the input text, but it definitely seems plausible. Here is a table showing how the various concepts were classified (Interventional = more likely to be disease-modifying, Management= more likely to manage symptoms):

Screen Shot 2023-07-12 at 8 43 46 AM

Lastly, I ran this analysis multiple times since the prompts are so unconstrained and the variance in the results between runs is pretty high. I cherry-picked what I would rate as the best result above. You can see all of them though in https://github.com/eric-czech/ngly1-gpt/tree/main/ngly1_gpt/notebooks/exports.

eric-czech commented 1 year ago

[PoC 2] NGLY1 deficiency relation extraction

Overview

A detailed description of the setup, prompts, results, etc. for this is in notebooks/relation_extraction.ipynb.

This prompt summarizes the intent of the experiment well: https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/relation_extraction_1.txt#L1-L116

The predicates and entities used in that prompt are coming from a combination of those mentioned in https://github.com/SuLab/DrugMechDB#curation and the statements/concepts at https://github.com/SuLab/ngly1-graph/tree/master/neo4j-graphs/ngly1-v3.2/import/ngly1, as well as one or two more I added myself (e.g. clinical trials).

The prompt above was run over chunks of text across the two papers manually curated in PMC7153956, i.e. PMC7477955 and PMC4243708.

Results

Entities/Relations

Specific examples of input text -> output relations look quite encouraging. Cherry-picking a couple of these (with more in the notebook mentioned above), I see that messy inputs like this (from PMC7477955) are handled as expected:

Screen Shot 2023-07-12 at 12 01 30 AM Screen Shot 2023-07-12 at 12 01 36 AM

Same here:

Screen Shot 2023-07-12 at 12 02 06 AM

Looking more broadly at the frequency of different co-occurring entities in the relations across both papers, i.e. all extractions like those above, I see a picture like:

Screen Shot 2023-07-12 at 12 22 32 AM

This is more or less what I would have expected too. This experiment doesn't include grounding or comparisons to external datasets, so I won't read into this too much, but it does demonstrate that nearly all of the entities suggested in the prompt are recognized to some extent. It also shows that I am not getting back entity types (i.e. subject/object types) that I did not ask for.

The resulting predicates, on the other hand, are more unruly. Here are relation counts by predicate and whether or not I was "expecting" them, i.e. expected => explicitly stated in the prompt and unexpected => GPT4 decided to include them on its own:

Screen Shot 2023-07-12 at 12 27 50 AM

Here are all of the "unexpected" predicates with their associated concepts:

Screen Shot 2023-07-13 at 6 15 02 AM

And here are a few I would say imply useful potential improvements to the prompt:

I would speculate that one potential explanation for why it keeps giving me predicates I didn't ask for is that I am prompting it to extract ALL relations without telling it what to do when it encounters a subject + object in my list of subject/object types and a predicate relating them that is not in my list of specific predicates. It might be possible to resolve that contention by telling it to drop those cases. They're helpful though IMO, and easy enough to filter out post-hoc.

Genetics

This is probably my favorite result from this experiment because it immediately suggested that it's possible to start answering a number of important questions about genetic disease etiology that I hadn't thought about when designing the prompt. This table below is showing a subset of the relations extracted where either the subject or the object is a genetic variant or protein variant.

Screen Shot 2023-07-11 at 11 18 32 PM

Here are all the distinct types of relations in this table, i.e. if you omit the specific subject/object:

Phenotypes

I also looked more closely at any relations between NGLY1 deficiency and another entity falling into the phenotype, disease, or symptom categories. There are 290 of these terms identified across both papers and as far as I can tell so far, the vast majority of them are correctly attributed. Grounding them and comparing them quantitatively to the phenotypes from PMC7153956 should be possible (i.e. https://github.com/SuLab/ngly1-graph/tree/master/neo4j-graphs/ngly1-v3.2/import/ngly1). That would certainly help say more about accuracy.

More qualitatively though, it's clear that these terms are all (or very close to all) correctly being classified as phenotypes/diseases/symptoms by GPT4. The sheer volume and specificity is also encouraging:

Screen Shot 2023-07-12 at 12 36 54 AM

These categorizations of the terms (e.g. Metabolic Disorders, Brain & Neurological Conditions) are coming from dumping all 290 terms into a prompt, asking GPT4 to define some number of categories for them and then spit me out a mapping from the terms to the categories. You can see the prompt for this here: https://github.com/eric-czech/ngly1-gpt/blob/5bcaaad4771b70e6f7f0f11c5a03db2f0a44d68d/ngly1_gpt/resources/prompts/phenotype_categories_1.txt#L1-L20

I think that's a good LLM use case in problems like this -- they are definitely good at building taxonomies/ontologies.

Anyways, the dropdowns below show the full list of categories and visualizations of more terms for them:

More term wordclouds ![ngly1_phenotype_wc](https://github.com/eric-czech/ngly1-gpt/assets/6130352/3d85ca38-e3ae-4d62-8ae4-5b0a2902f573)
Term frequency by category category|n_terms --------|------- Eye Problems|44 Brain & Neurological Conditions|36 Motor Disorders|34 Liver Disease|25 Metabolic Disorders|22 Musculoskeletal Problems|15 Physical Characteristics|13 Other|11 Bone Problems|11 Blood Conditions|10 Heart Conditions|10 Growth Issues|10 Cognitive & Learning Issues|8 Communication & Social Skills|8 Digestive Issues|6 Global Development Delay|6 Feeding & Swallowing Disorders|5 Ear Problems|3 Respiratory Problems|3 Immune & Allergy Issues|3 Motor Skills Development|3 Endocrine Disorders|2 Skin Conditions|1 Kidney Problems|1
eric-czech commented 1 year ago

[PoC 3] NGLY1 deficiency patient extraction

This experiment was a lot more involved than the others and attempted to extract as much detail as possible on an individual patient level from the same 2 papers as PoC 2. If something like this was deployed on a larger scale, it could enable some very interesting use cases like phenotype frequency inference and automated rare disease meta-analysis (among others). I wanted to show that it works on a smaller scale first since there is plenty of complexity inherent to how those 2 papers present patient-specific data.

Overall, my read is that this works quite well and modulo some likely solvable problems, I'm quite happy with how little code this took and the potential it implies.

Here is a visual outline for the pipeline:

Screen Shot 2023-07-17 at 5 37 26 PM

Prompts in the outline:

  1. Unstructured patient detail extraction
  2. Schema inference
  3. Record extraction
  4. Phenotype frequencies

A few other high-level notes:

Patient detail extraction

These logs are a good indicator of how well an LLM can attribute details to individual patients: extract_patients.log.txt.

These two papers end up providing a nice contrast because PMC7477955 does a relatively poor job of associating data to specific patients in the main text and favors summaries more like "Bone age was delayed in eight of the 11 subjects". There are a number of supplementary tables with more individual-level data, but by and large PMC8997433 is much better at associating findings with individuals in both a structured and unstructured fashion.

On the structured front, here is an example from the log of a table (Table 1) in PMC8997433 and how it is converted into essentially unstructured text. I'm later using these unstructured details to populate the patient records:

https://github.com/eric-czech/ngly1-gpt/blob/e66f513f39a3a8d02a63615f9e7f4a57fa8e76ef/data/logs/extract_patients.log.txt#L291-L363

I was worried about this round-trip from table -> text -> json and rightfully so. This does introduce some errors I think could probably have been avoided. You can see more on that in the Table replication section below. If I did this again, I would definitely redesign this step by looking for a token-efficient output format that better supports a mixture of nested, structured and unstructured data. Yaml may be a better choice.

Schema inference

Once all the "details" are collected, I am randomly sampling them and providing them to the LLM with a prompt to generate a JSON Schema. This is done with a single completion and here is a log showing the prompt + response: infer_patients_schema.log.txt.

And here is the kind of schema it comes up with:

https://github.com/eric-czech/ngly1-gpt/blob/5f0fd67a53fc924504a0d7a54276f731db94a6e5/data/output/patients.schema.json#L1-L174

I think this is a fiddly bit that would be best to do without in the future, in favor of just manually building a schema or drawing on schemas from something like MIMIC2 or phenopacket. That latter is particularly relevant given:

The goal of the phenopacket-schema is to define the phenotypic description of a patient/sample in the context of rare disease, common/complex disease, or cancer. The schema as well as source code in Java, C++, and Python is available from the phenopacket-schema GitHub repository.

Reminder to self: Talk to @pnrobinson about the state of phenopacket again.

Example patient records

All of the extracted records are in here: https://github.com/eric-czech/ngly1-gpt/blob/e66f513f39a3a8d02a63615f9e7f4a57fa8e76ef/data/output/patients.json#L1-L22

A random sample of 4 of them looks like this:

Screen Shot 2023-07-17 at 5 12 28 PM

The differences in sparsity between the two studies might looking concerning at first glance, but they do make sense when you dig in on the papers. PMC7477955 doesn't describe most phenotypes for patients in a way that makes it possible to track them back to individuals. It does though give a bunch of supplementary tables with lab panels, EEG findings, nerve conduction studies, etc. Here is one on seizures and EEGs, which is almost certainly why the LLM chose to create a separate seizure_info field: https://github.com/eric-czech/ngly1-gpt/blob/e66f513f39a3a8d02a63615f9e7f4a57fa8e76ef/data/extract/PMC7477955.txt#L148-L163

Results

Demographics/genetics

One of the first things I wanted to see is whether or not it was possible to at least get accurate extractions of details like this across studies:

Screen Shot 2023-07-17 at 12 35 24 AM

Being able to pull out demographics/mutations like this could obviously speed up writing rare disease review papers a ton, at the very least.

I don't have much more to say on this other than that I've reviewed every piece of information there and didn't find any mistakes for any patients in either study. So it's a good start!

Phenotype frequencies

Inferring frequency of phenotypes also appears to work reasonably well. I'll note that this is including pieces of information like this where individual attribution is not possible (from PMC4243708):

Other common findings included hypo- or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8)

This means that there may be multiple estimates of frequencies for the same phenotype if the text and tables incorporate the same information. That's shown below in the all_frequencies field, which gives the patient count / total patients ratio for any one occurrence of a frequency estimate:

Screen Shot 2023-07-17 at 12 18 55 AM
Full table ![screencapture-file-Users-eczech-repos-misc-ngly1-gpt-ngly1-gpt-notebooks-test-html-2023-07-17-00_15_29](https://github.com/eric-czech/ngly1-gpt/assets/6130352/2a46162d-d0dd-4b29-8210-0bc165791639)

An important omission from this is that I'm using the same denominator (i.e. total patient count across a study) for all frequencies. That's certainly not always the case since some assessments/measurements aren't made for all patients, e.g. in the excerpt I included above. I think that could probably be accounted for in the future without too much trouble though.

Table replication

In order to say more about validity/accuracy, I checked some of the tables in the papers to see how well reproductions of them from the individual patient records lined up. I think this figure sums that up pretty well:

table_replications

See the detailed notes in the dropdown below for what is incorrect for PMC4243708: Table 1.

Table 1 error summary ----- - There are 26 phenotypes in the actual table and 24 in the data (missing "ABR abnormalities" and "Strabismus") - "ND", for "Not Determined", is treated as absent although the key for abbreviations wasn't included so that's somewhat expected - Specific mistakes by phenotype: - elevated AFP: patients 6/7/8 are - in the data but ND in actual table (no NDs seem to be preserved) - peripheral neuropathy (1): patient 5 should be + - EEG abnormalities (1): Patient 7 should be - - decreased DTRs (2): patient 5 should be + and patient 6 - - alacrima/hypolacrima (1): patient 6 should be - - constipation (1): patient 6 should be - - neonatal jaundice (2): patient 1 should be +, patient 5 should be - - dysmorphic features (2): patient 4 and 6 are wrong - lactic acidosis (3): 3 patients are wrong This means that the total number of mistakes is roughly = 2 (missing phenotypes) x 8 (patients) + 13 (errors) = 29. There are 34 (fields including demographics) x 8 (patients) = 272 cells in the actual table so the **error rate** here is ~ 29 / 272 = **10.6%**. This error rate is 13 / (272 - 29) = **5.3%** if you don't count missed phenotypes. -----

The two strategies that come to mind first that might mitigate some of the errors are:

  1. Run the whole process multiple times and look for conflicts/uncertainty
  2. Don't use unstructured text as an intermediate representation for tables
    • I'm shocked this worked as well as it did frankly, and it could certainly be avoided with more engineering work