exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
202 stars 55 forks source link

Add a variant whitelist for known pathogenic variants or use ClinVar data #152

Closed julesjacobsen closed 5 years ago

julesjacobsen commented 8 years ago

There are often cases where a known pathogenic variant my occur more frequently in a population than the frequency cutoff filter is set to.

Enable the ability to specify variants which should pass the frequency filter even if they are over the frequency threshold.

rdemolgen commented 8 years ago

Two resources that could be used to provide the known pathogenic annotation are HGMD (professional) and ClinVar. HGMD professional doesn't as far as i know provide an easy to use API even if you do have an account. ClinVar is complicated by the conflicting reports of pathogenicity that often appear, maybe just take the most recent report? Alternatively you could allow users to supply their own list of HNGC approved gene symbols (possibly with a n additional field specifying reference paper).

damiansm commented 8 years ago

A user supplied list of whitelist variants would be best I think. Then if they have HGMD they can convert the file download to our format and use. Similarly they could use their own curated version of ClinVar. We could supply the latest ClinVar as a default example file to start with but most people don't trust ClinVar absolutely. The ability to keep all ClinVar variants and flag them up would be useful though maybe. One complication with this strategy is we use the MAF in our scoring so if a variant is more than 2% it will score really badly and not be prioritised anyway

On Wed, Nov 9, 2016 at 11:40 AM, rdemolgen notifications@github.com wrote:

Two resources that could be used to provide the known pathogenic annotation are HGMD (professional) and ClinVar. HGMD professional doesn't as far as i know provide an easy to use API even if you do have an account. ClinVar is complicated by the conflicting reports of pathogenicity that often appear, maybe just take the most recent report? Alternatively you could allow users to supply their own list of HNGC approved gene symbols (possibly with a n additional field specifying reference paper).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/152#issuecomment-259395295, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPD1DVhhU6O-ZkdyGyZWCfOfSNKC4ks5q8bEmgaJpZM4Ktako .

julesjacobsen commented 8 years ago

This would most likely need to be implemented as a new frequency filter which would allow the whitelisted variants through regardless of their frequency.

Another consideration is the pathogenicity score of these. Would we want to give these a score of 1.0 (maximally pathogenic) by default, or just leave that as is?

Are we expecting a curated list of 100-1000s of variants or just a select few? Most likely the former, I'd expect?

damiansm commented 8 years ago

Could the current Frequency filter take an optional whitelist to ignore? Would count on there being 100-1000's in case people use whole of ClinVar.

The scoring is a tricky one. As things stand if we want them to stand out and appear at the top of the list or at least not at the bottom we would maybe need to ignore the frequency and pathogenicity scores for them and just give them all a variant score of 1 as a user is already saying they know it is pathogenic and involved in rare disease. Think in an ideal world you would be able to see the annotated frequencies and pathogenicy scores for these variants but they would not contribute to the variant score. This is tricky! May be best to whitelist things further up the chain i.e. something like

if (variant.isWhiteListed()) variant.setStatus("PASSED") variant.setScore(1); else run all the usual filtering and scoring

On Wed, Nov 9, 2016 at 12:19 PM, Jules Jacobsen notifications@github.com wrote:

This would most likely need to be implemented as a new frequency filter which would allow the whitelisted variants through regardless of their frequency.

Another consideration is the pathogenicity score of these. Would we want to give these a score of 1.0 (maximally pathogenic) by default, or just leave that as is?

Are we expecting a curated list of 100-1000s of variants or just a select few? Most likely the former, I'd expect?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/152#issuecomment-259402140, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPPyenuYjrD_3g1ss44p6q7hduMZWks5q8bpIgaJpZM4Ktako .

julesjacobsen commented 8 years ago

Having had a quick look it looks like the easiest way to make this work is to add logic to the frequency and pathogenicity filters to pass whitelisted variants. That way the main application logic doesn't need to be touched. Other filters such as interval, geneId, quality, variant effect filters and the like should still work as before.

What should the behaviour be for the knownVariantFilter. It somewhat depends on your motivation for supplying the whitelist. If you're adding all of ClinVar to the whitelist and then you apply the knownVariant filter what would you expect?

This will involve a change to VariantEvaluation, some way of adding this whitelisted data and then presumably a flag in the output files to indicate that this is a special case. Might also need a new PathogenicityScore so that this is added and always considered pathogenic.

pnrobinson commented 8 years ago

Could not we take the ClinVar pathogenic variants? Even if this may change occasionally, that is a GOOD THING (R), especially in comparison to HGMD where apparently there is littel curation of a mutation once it is put into the database (many papers referenced in HGMD have been retracted etc.)

visze commented 8 years ago

Why not using any arbitrary vcf as whitelist? Every Variant that is present will be passed through... So the user can use clinvar or an own generated file (e.g. from hgmd or inhouse)

This raises a new question about adding own Population frequencies e.g. from inhouse sequncing Data. But I can Open another Ticket about that..

Am 13.11.2016 19:39 schrieb "Peter Robinson" notifications@github.com:

Could not we take the ClinVar pathogenic variants? Even if this may change occasionally, that is a GOOD THING (R), especially in comparison to HGMD where apparently there is littel curation of a mutation once it is put into the database (many papers referenced in HGMD have been retracted etc.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/152#issuecomment-260203502, or mute the thread https://github.com/notifications/unsubscribe-auth/AI1nsDRfGAVrgWgDXtK8suP3FItotlg8ks5q91lcgaJpZM4Ktako .

julesjacobsen commented 8 years ago

Yeah - I wanted the user to have control over what data is used as a whitelist and leave it to them to supply it. They could use whatever data source they wanted - HGMD, ClinVar, in house data or a mixture of them all, but to do this they would need to munge the data into their own list into the format exomiser requires.

However I don't want to start thinking about formats as yet. We need to define the actual use-cases and requirements for this. i.e. a user-story.

e.g. using the BDD given, when then approach:

Case: Variants on the whitelist pass the pathogenicity filter even if they are considered non-pathogenic
Given: A user has defined a variant in a whitelist.
Given: The variant has non-pathogenic PathogenicityScores.
When: The pathogenicity filter is run.
Then: The variant will PASS the pathogenicity filter.

another case might be:

Case: Variants on the whitelist pass the frequency filter even if they are over the frequency threshold
Given: A user has defined a variant in a whitelist.
Given: The variant is has a Frequency of 2%.
When: The frequency filter is run with a threshold of 1%.
Then: The variant will PASS the frequency filter.

Now we need to define the whitelist - does it just contain a list of variants:

10-12234-A-G
2-43345562-T-G

or does there need to be a frequency component too?

10-12234-A-G
    local: 0.01
2-43345562-T-G
    local: 3.56

There is another issue #78 which might conflict with this for example.

damiansm commented 8 years ago

Think the user story should be pretty simple: "if on whitelist always PASS any filter". Only tricky one is inheritance. Often you want to see your known pathogenic variants, even if you don't have a variant in the other copy of the gene for recessive conditions so you can look more carefully into your calling, CNVs etc. But some users will want them removed as they are most likely just carrier mutations.

Suspect the simplest to implement is skip filtering for all VARIANT filters but leave in the GENE level filters like inheritance.

We should keep the whitelist as just a list of variants. The use case for supplying in-house variant freq data is separate and usually opposite i.e. you are looking to filter on these

On Wed, Nov 16, 2016 at 3:42 PM, Jules Jacobsen notifications@github.com wrote:

Yeah - I wanted the user to have control over what data is used as a whitelist and leave it to them to supply it. They could use whatever data source they wanted - HGMD, ClinVar, in house data or a mixture of them all, but to do this they would need to munge the data into their own list into the format exomiser requires.

However I don't want to start thinking about formats as yet. We need to define the actual use-cases and requirements for this. i.e. a user-story.

e.g. using the BDD given, when then approach:

Case: Variants on the whitelist pass the pathogenicity filter even if they are considered non-pathogenic Given: A user has defined a variant in a whitelist. Given: The variant has non-pathogenic PathogenicityScores. When: The pathogenicity filter is run. Then: The variant will PASS the pathogenicity filter.

another case might be: Case: Variants on the whitelist pass the frequency filter even if they are over the frequency threshold Given: A user has defined a variant in a whitelist. Given: The variant is has a Frequency of 2%. When: The frequency filter is run with a threshold of 1%. Then: The variant will PASS the frequency filter.

Now we need to define the whitelist - does it just contain a list of variants:

10-12234-A-G 2-43345562-T-G

or does there need to be a frequency component too?

10-12234-A-G: local: 0.01 2-43345562-T-G local: 3.56

There is another issue #78 https://github.com/exomiser/Exomiser/issues/78 which might conflict with this for example.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/152#issuecomment-260979120, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPEO4Ui-yEA9Cfr6Z423DzS9K1eD8ks5q-yRpgaJpZM4Ktako .

damiansm commented 6 years ago

Note we have already brought in Clinvar to the database and now flag whether the variants that have passed the filters are in Clinvar and their pathogenicity status

damiansm commented 6 years ago

Some further investigation of the GEL missed top 5 diagnoses suggest scoring known Clinvar pathogenic variants as variantScore = 1 may be sensible and worth investigating. See https://github.com/exomiser/Exomiser/issues/296

pnrobinson commented 5 years ago

According to the analysis I did of Exomiser scores of ClinVar variants, they have a 100% score the great majority of the time. I would therefore also take a look and see if we have cases where a ClinVar-pathogenic variants was the cause of disease in cases with bad phenotype matches?

damiansm commented 5 years ago

@pnrobinson As I just discovered with Jules we need to be careful when we are talking about pathogenicityScore, frequencyScore and variantScore (combination of both) in Exomiser to avoid confusion! For the GEL cases I also see that most of the Clinvar variants get a high pathogenicityScore ~ 1 with exception of some splice region variant which we score 0.8 and would nice to bump up to 1. But some are not super-rare so end up with lowish overall variantScores. Jules and I have been debating whether a fairly common known pathogenic Clinvar variant should get a variantScore of 1 or the frequency-weighted score. We are leaning towards the latter as it is safer and I have already seen one example of a 0.5% known pathogenic clinvar variant being re-classified in latest release to conflicting reports.

Jules has prepared a test release which scores all known and likely pathogenic clinvar variants with a pathogenicityScore=1 and on my GEL verification set it does not change overall performance and I can see it has improved the rank of some of the cases that were outside the top 5. Some more detail:

julesjacobsen commented 5 years ago

I've put some changes in for this which is now a lot more consistent in reporting ClinVar variant data. Previously it would only report it for non-synonymous, coding region variants, but now they will be reported for all variants.

Secondly, variants with a ClinVar primary interpretation of Pathogenic, Pathogenic_or_likely_pathogenic and Likely_pathogenic , irrespective of their review status, are scored as maximally pathogenic.

Note that commit 7aa8485 doesn't override the frequency filtering for pathogenic ClinVar variants or implement a user-definable whitelist. Shall we wait for the performance testing/next release to re-visit those ideas?

damiansm commented 5 years ago

Think from all the discussion as detailed above what we have now is good to go e.g a pathogenic/likely pathogenic clinvar variant gets pathogenicityScore=1 and a proper frequency-weighted, overall variantScore

williakd17 commented 5 years ago

Thought I'd offer my current perspective on this issue. I am currently evaluating the gene ranking performance against a commercial software's and here's what I've found. I've analyzed the gene ranks for 100 positive Exome vcfs. For the positive reported genes and their associated variants, 40% were a top candidate, 70% were in the top 5, 83% were in the top 10, and 93% were in the top 21. However from a clinical standpoint, the limitations of filtering out common pathogenic/likely pathogenic etc. genes/variants are a large barricade. I also evaluated the results of loosening the parameters (expanding frequency filters and minimum quality scores), and it was a pretty significant reduction in performance. The ranking of the genes has exceeded my expectations and it blows the other software out of the water (would need to review around 150 genes and their associated variants to cover 95% of the known positives in this sample pool), however missing variants in genes known to be associated with phenotype is a major issue. I would absolutely love a whitelist implementation over implementing ClinVar data. Reason being Exomiser performs extraordinarily well in instances outside of common variants. A couple examples of common "smoking gun" variants that are filtered out are as follows: g.29825015 PRRT2 NM_001256443 c.649dupC p.Arg217Profs*8 Heterozygous

Implementation of a whitelist so specific variants could avoid filtering on these paths would be incredibly useful. It would essentially remove the only current limitation of Exomiser for me.

pnrobinson commented 5 years ago

This is a great suggestion, thanks.

damiansm commented 5 years ago

+1 from me as well. As well as being one of the main Exomiser developers, I am using it for the 100,000 Genomes Project and missing those known pathogenic but common variants is definitely one of the sources of missed diagnoses by Exomiser for us. We also see great performance on the 100KGP set, even higher than you report but we have some trios as well, and would be great to not miss any!

Lets revisit this one in the new year.

On Sat, Dec 22, 2018 at 11:29 AM Peter Robinson notifications@github.com wrote:

This is a great suggestion, thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/152#issuecomment-449563840, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPIwA3I0eIaYTDmFM9BoGWbzQuSJ5ks5u7hergaJpZM4Ktako .

julesjacobsen commented 5 years ago

Trying to scope out the extent of the change here - would it suffice to override the frequency and pathogenicity filters or would you want to see these even when other filters might also have failed, e.g. quality?

It looks like what you actually want isn't strictly a catch-all whitelist but more of a frequency filter override list? Secondly would you want to give all these an automatic variant (i.e. path and freq) score of 1?

damiansm commented 5 years ago

I would not override all filters for these known pathogenic variants e.g. we want it to segregate in the family properly with the expected mode of inheritance, pass quality filters etc.

The use-case in my mind is to allow a 3% frequency, known pathogenic variant that is typically seen in conjunction with a second, much rarer allele through the autosomal recessive filter, even if the freq cutoff is set to 2% and give that variant a variantScore=1.

A simpler solution may be to force the user to up the frequency filters to 5% and score the variants as usual except give these known pathogenic variants a score of 1

julesjacobsen commented 5 years ago

I was thinking of a different approach - the user provides a whitelist file for each assembly just like the local frequency file. Any variant appearing in the sample which is on this list is marked as whitelisted and is allowed through the frequency and pathogenicity filters and has a variant score of 1.

This will allow correct segregation, quality filtering etc. but overcome the issues mentioned by @williakd17 and will keep the sensitivity for other variants.

This is a pretty simple solution to implement as well.

damiansm commented 5 years ago

That def works and cool if it the easiest one to implement. People can use a file of clinvar known pathogenic as a default whitelist then if they want.

julesjacobsen commented 5 years ago

@williakd17 @pnrobinson would you say that the size of the list is only going to be in the range of a few thousand, potentially tens of thousands if including the ClinVar Path/Likely_path?

On a separate note, these variants ought to be kept up to date by the user otherwise there is a danger of false positives.

damiansm commented 5 years ago

Now implemented and performing well. We will supply a suggested whitelist file of all non-conflicting pathogenic and likely pathogenic variants in the new variant dbs and this can be turned on/off in application.properties. Users can modify the whitelist as required e.g. adding known diagnoses they have seen in their own projects