broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
175 stars 89 forks source link

Add LIRICAL and Exomiser phenotype-based prioritization scores #2742

Closed bw2 closed 1 year ago

bw2 commented 2 years ago

LIRICAL (described here and here ) takes HPO terms and a single-sample VCF as input and generates a table of ~100 candidate gene-disease pairs. Each candidate has the following fields which will be important to show in seqr: post-test probability (float) example: 86.63% rank (int) example: 3 disease name (str) example: "Myopathy, mitochondrial, and ataxia" gene id (str) the default is Entrez gene id (example: NCBIGene:55154), but I could convert to ENSG ids upstream of seqr

For the initial implementation, seqr could use the gene id to join LIRICAL results to seqr variants, and seqr users would be able to filter and/or sort results by lirical rank.

A more advanced implementation would also give users access to the html page that LIRICAL generates for each affected individual. This would allow users to see plots like:

image

which show which phenotype terms are consistent with or contradict a given gene-disease candidate.

bw2 commented 2 years ago

These are the fields in the TSV that will be uploaded for LIRICAL and Exomiser:

tool - "exomiser" or "lirical" sampleId - the tools will be run for affected individuals rank (integer) - this will be unique for LIRICAL, but not for Exomiser (multiple rows will have the same rank)

geneId (str) - Ensembl gene id diseaseId (str) - examples: OMIM:618460, ORPHA:71517 diseaseName (str) - example: Cerebellar ataxia-areflexia-pes cavus-optic atrophy-sensorineural hearing loss syndrome

scoreName1 (str) - label describing what's in the Score field score1 (float) - this will be the compositeLR or post-test probability for LIRICAL and the exomiser score for Exomiser scoreName2 (str) - optional label describing what's in the 2nd Score field score2 (float) - An optional second score (ie. the variant score from exomiser) scoreName3 (str) - optional label describing what's in the 3rd Score field score3 (float) - An optional third score (ie. the phenotype score from exomiser)

The tool, sampleId, rank and geneId will always be non-null. Other fields may be null.

hanars commented 2 years ago

For MVP: we will add per-gene records similar to RNA seq outlier data, which will show as a flag on the gene in search results.

Future enhancements:

bw2 commented 2 years ago

Variants are an optional input to LIRICAL - it can be run with just a list of HPO terms (eg. phenotype-only mode). There's now interest in adding LIRICAL phenotype-only scores to seqr results so that genes can have a flag indicating that they are relevant to the phenotype. I think the logical way to provide these scores would be to use the same .tsv format as above but just specify the tool column as something like "lirical_phenotype_only".

ShifaSZ commented 2 years ago

For MVP: we will add per-gene records similar to RNA seq outlier data, which will show as a flag on the gene in search results.

I would like to learn a little more about the RNA seq outlier feature. I remember there was a ticket for it, but I can't find it now. I want to understand who and how to add RNA seq outlier data into the seqr, and how the users use the feature.

hanars commented 2 years ago

The initial PR for adding the outlier data is here: https://github.com/broadinstitute/seqr/pull/2345/files

hanars commented 2 years ago
hanars commented 2 years ago

potential for later work (not in MVP): instead of linking to static HTML files we can generate the plots directly in seqr. May require adding additional data to the tsv

ShifaSZ commented 2 years ago

Tasks for this ticket:

  1. Add a new model for the LIRICAL/Exomiser data
    class LiricalOrExomiser(DeletableSampleMetadataModel)
  2. Add a data manager interface for loading TSV files generated by LIRICAL/Exomiser The UI is as below figure. image Question: do we need the mapping sample IDs?
  3. Show a flag in the variant gene area of the variant panel, and show a popup to display the top-ranked gene-disease pairs. The top 10 gene-disease pairs will be displayed.

The flag: image

The popup window: image Questions: I'm unsure how many scores we have for the LIRICAL and Exomiser. I can find two scores for LIRICAL and three for Exomiser from the example data. Are these scores all that we can have? Should we display the LIRICAL and Exomiser data separately or combinedly? Should we also display the disease IDs?

bw2 commented 2 years ago

Are these scores all that we can have?

yes

Also, I'd vote for not hardcoding the "Lirical" and "Exomiser" names in the code and UI. The tools we use may change in the future. So would use

class PhenotypeBasedPrioritization(DeletableSampleMetadataModel)

or

class PhenotypeBasedScores(DeletableSampleMetadataModel)

instead of

class LiricalOrExomiser(DeletableSampleMetadataModel)

Should we display the LIRICAL and Exomiser data separately or combinedly?

The tool column in the input table should be the source of the label for the gene tag. I'd also vote for having a separate tag for each tool.. so

image

and

image

instead of

image

Should we also display the disease IDs?

It would probably be useful to show them and link to the corresponding OMIM page

hanars commented 2 years ago

Question: do we need the mapping sample IDs?

No, there is no reason to think we would need to and we can always manually fix an ID in the file if we really have to. If it turns out we constantly are needing to adjust them then we can go back and add this functionality later

Also, I'd vote for not hardcoding the "Lirical" and "Exomiser" names in the code and UI.

On the one had I agree and I like PhenotypePrioritization for the model call name but either of the names he proposed are also good. If we go with that approach I also agree that there should be a tool column and that the label should reflect the specific tool, although I think it should be an enum so if we did want to add future tools it would be an explicit code change.

Should we display the LIRICAL and Exomiser data separately or combinedly

Separately. I think there should be one label per tool that it is flagged in, and then on hover the table should just show the correct scores for that tool. For the hover, you also do not need to show the gene id, as the label is next to the relevant gene.

Also, I think the label should be a different color than the rna seq labels since it means something pretty different. I would recommend yellow, but if you prefer we could also do orange so it visually looks related to OMIM.

ShifaSZ commented 2 years ago

Summaries of the discussions:

  1. Add a new model for the phenotype-based prioritization data

    class PhenotypePrioritization(DeletableSampleMetadataModel)

    The columns include: tool (enum), rank, disease_id, disease_name, score1, score2, and score3. The gene ID and sample ID are in the base class.

  2. Add a data manager interface for loading the phenotype-based prioritization data (a .tsv file) The UI is as below figure. image

  3. Show a flag on the gene and a popup. The scores are PTP and LR for the LIRICAL flag and exomiser, phenotype, and variant for the Exomiser flag. image

bw2 commented 2 years ago

Are you thinking not to include

scoreName1 (str) - label describing what's in the Score field
scoreName2 (str) - label describing what's in the Score field
..

columns?

hanars commented 2 years ago

I don't really like the ides of like sore1, scoreName1 etc, but I do like the idea of storing this generically so we don't need to update the model if we add a new score in the future. I think maybe a better way to do this would be a json field, with a dictionary where the keys are score names and the vales are scores. So in the example above scores would look like {"PTP": 0.833, "LR": 0.255}

hanars commented 2 years ago

Also, should we be storing disease ID and disease name? Or do we want to just store disease id and get the diseas name from the OMIM models?

bw2 commented 2 years ago

I don't have a preference on this. If you decide not to store the disease name, it would be good to check that you recognize the omim id at loading time.

ShifaSZ commented 2 years ago

I checked the disease IDs in the sample files against the prod reference DB. I found there are many diseases are missing in the database.

hanars commented 2 years ago

that pretty concerning to me as we do a full download of the latest OMIM every week so we shouldn't be missing any of them. Can you make sure you are checking against the current prod db and not using an outdated local one? And if there are still many missing can you give me a few examples?

ShifaSZ commented 2 years ago

After updating my local reference DB, no OMIM IDs in the sample files are missing. We have another problem how we can get the diseases for the "ORPHA" disease IDs like "ORPHA:2131", "ORPHA:71517", etc.?

ShifaSZ commented 2 years ago

@bw2 Are the score names always the same for each tool? If yes, can we define tool-based constants for the score names?

ShifaSZ commented 2 years ago

There are quite some codes for RNA seqr outlier/tpm data loading/validating, which can be reused for the LIRICAL/Exomiser data loading. Is it okay to update the existing codes and make them reusable?

hanars commented 2 years ago

We have another problem how we can get the diseases for the "ORPHA" disease IDs like "ORPHA:2131", "ORPHA:71517", etc.?

I don't really understand this question. Where are these disease IDs coming from and why do we need to get the diseases from them?

Are the score names always the same for each tool? If yes, can we define tool-based constants for the score names

I don't want to do this unless there is a good reason to. What are the advantages for defining tool-specific constants instead of just storing generic json?

Is it okay to update the existing codes and make them reusable?

Yes! If you know in advance which ones you want to do it might be helpful to make the code reusable in a standalone PR first and then add the new functionality branched off of that

ShifaSZ commented 2 years ago

We have another problem how we can get the diseases for the "ORPHA" disease IDs like "ORPHA:2131", "ORPHA:71517", etc.?

I don't really understand this question. Where are these disease IDs coming from and why do we need to get the diseases from them?

If you open the file exomiser_for_seqr.tsv you sent me last week, you can find there are some disease IDs starting with ORPHA: rather than OMIM:.

hanars commented 2 years ago

oh, that sounds like its orphanet probably. In that case it is probably better to just store disease id and disease name for everything. Unless @bw2 has some insight into mapping orphanet to omim

ShifaSZ commented 2 years ago

https://github.com/broadinstitute/seqr/blob/f32860689613ddcb152304411683d43e570c406d/ui/shared/components/panel/variants/VariantGene.jsx#L673-L684 @hanars, I have a question about the above code. In line#675, since the genes are filtered with the condition showDetails(gene) without rnaSeqData in the parameter, the RNA-seqr outlier section of the configurations will be excluded. Therefore, line#682 doesn't need the rnaSeqData parament. Is that correct? Or should line#675 include the rnaSeqData parameter?

hanars commented 2 years ago

to maintain current behavior you are correct that line#682 doesn't need the rnaSeqData, but I think it would be better for line#675 to include it instead

ShifaSZ commented 2 years ago

@hanars It seems that I need to create a new sample type for the LIRICAL/Exomiser data. What's the name for the new sample type do you like? 'PHENOTYPE', 'PHENOTYPE_PRI', or any other? Or, we could re-use the sample type 'RNA' with the RNA-seq outlier and tpm data. But if we re-use the 'RNA' sample type, we'ld better to rename it.

hanars commented 2 years ago

Don't make a sample for them at all, just associate them directly with the individual model. There is no underlying sample being called for these, and we don't want to include them in any of our "data loaded" summaries as they do not, in fact, represent any loaded data

ShifaSZ commented 2 years ago

Don't make a sample for them at all, just associate them directly with the individual model.

Okay, then I won't use the DeletableSampleMetadataModel base class. Instead, I define a new class with an Individual foreign key field.

ShifaSZ commented 2 years ago

The tool, sampleId, rank and geneId will always be non-null. Other fields may be null.

@bw2 The rows are not unique when using the combination of sampleId and geneId as a key. Should we add diseaseId to the key? If yes, then the diseaseId must be non-null as well.

hanars commented 2 years ago

I don't see why you need to enforce a unique key here at all. There will be a database ID for uniqueness

ShifaSZ commented 2 years ago

The issue without a unique key is that we can't update existing data when we load a new data set. Do we add new data entries or remove existing entries and insert new ones?

hanars commented 2 years ago

If we have new data for a given sample and tool, we should delete all the data for that sample and tool and reload it completely based on the new data. trying to compare what we used to have and what is new is too complicated

ShifaSZ commented 2 years ago

As a summary of the discussions:

  1. Create a model for the phenotype-based prioritization data
    class PhenotypePrioritization(models.Model)

    with fields tool (enum), individual, rank (integer), gene_id (string), disease_id (string), disease_name (string), scores (JSON).

  2. Replace the existing entries with the same sample (individual) and tool with new data when loading a new data set

For data loading, there is not much code that can share with RNA-seqr outlier/tpm. We will implement the phenotype-based prioritization scores separately.

lynnpais commented 2 years ago

Will there be an option on the Family Page to quickly view the LIRICAL results (like we have the Outlier plot for RNASeq) in this release? From Hana's earlier comment, it sounds like this is planned as a future enhancement. If so, can we have a link to the static HTML page if it is the easiest to implement for now?

hanars commented 2 years ago

So this ticket is for the MVP (minimum viable product). With what was defined above as the minimal possible functionality, its about 3-4 weeks of work. Everything else discussed, including a family page link or a link to the HTML page, are out of scope for this ticket. The philosophy behind MVP in software development is that it is better to get something released in a month than get everything released in 2 months. Moreover, doing a smaller MVP release can help with relative prioritization - is it better to have lirical with no HTML page and have Iranmoe added to seqr, or is it better to have the lirical HTML page at the expense of other feature work?

"Future enhancement" can include something very fast, so if having these added for lirical would be the highest priority for Shifa to work on after the MVP, we could prioritize doing those immediately and there would not be much gap between go live and when the enhancements are added, it doesn't necessarily mean a long wait. One thing that might be helpful would be to make tickets for the future enhancements for Lirical, and then we can size and prioritize those against the other seqr work, as well as make it clear that they won't be done as part of this ticket

Also, the family page option is actually easier to implement than a link to static HTML, although both are possible

hanars commented 1 year ago

This has been added. I will work with @bw2 to actually get this data added to seqr, although probably we will wait until the family page summary UI is available