exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
202 stars 55 forks source link

Can dbNSFP provide all the required variant annotations? #134

Closed julesjacobsen closed 7 years ago

julesjacobsen commented 8 years ago

dbNSFP has grown a lot since exomiser started. Currently we're parsing a lot of seperate resources and only using dbNSFP to extract SIFT, Polyphen and MutationTaster scores. There are a lot of other variant resources in there including all the frequency sets we use, so it looks like it probably can - dbNSFP version 3.1a has:

Positions:

Frequencies:

Pathogenicity scores:

the Gene annotations include the HGNC gene id to UCSC, ENSEMBL, MGI, ZFIN, MIM gene, MIM disease and MIM phenotype ids.

Using this single resource would massively simplify the data build procedure @damiansm @pnrobinson - can you remember the reasons for using the current, non-integrated set of data sources? Other than becoming completely dependent on a single source of data, and relinquishing the ability to independently update a datasource can you think of any other reason not to use this?

pnrobinson commented 8 years ago

Not sure about the choise of data sources anymore, but dbNSFP is probably a great option and I do not see too many reasopns not to use it. We could also use the NCBI Entrez Gene ID; I did not see it in the list but perhaps it is also there? -Peter

Von: Jules Jacobsen [mailto:notifications@github.com] Gesendet: Dienstag, 31. Mai 2016 14:44 An: exomiser/Exomiser Exomiser@noreply.github.com Cc: Robinson, Peter peter.robinson@charite.de; Mention mention@noreply.github.com Betreff: [exomiser/Exomiser] Can dbNSFP provide all the required variant annotations? (#134)

Looks like it probably can - dbNSFP version 3.1a has: Positions:

the Gene annotations include the HGNC gene id to UCSC, ENSEMBL, MGI, ZFIN, MIM gene, MIM disease and MIM phenotype ids.

Using this single resource would massively simplify the data build procedure @damiansmhttps://github.com/damiansm @pnrobinsonhttps://github.com/pnrobinson - can you remember the reasons for using the current, non-integrated set of data sources? Other than becoming completely dependent on a single source of data, and relinquishing the ability to independently update a datasource can you think of any other reason not to use this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/exomiser/Exomiser/issues/134, or mute the threadhttps://github.com/notifications/unsubscribe/AEtuPBqizbKGbbdCgFslVn0d25MSi56Oks5qHC0tgaJpZM4IqfYl.

damiansm commented 8 years ago

Guess it only has the data for the non-syn variants though and we were getting some info on others from other sources?

On Tue, May 31, 2016 at 2:06 PM, Peter Robinson notifications@github.com wrote:

Not sure about the choise of data sources anymore, but dbNSFP is probably a great option and I do not see too many reasopns not to use it. We could also use the NCBI Entrez Gene ID; I did not see it in the list but perhaps it is also there? -Peter

Von: Jules Jacobsen [mailto:notifications@github.com] Gesendet: Dienstag, 31. Mai 2016 14:44 An: exomiser/Exomiser Exomiser@noreply.github.com Cc: Robinson, Peter peter.robinson@charite.de; Mention < mention@noreply.github.com> Betreff: [exomiser/Exomiser] Can dbNSFP provide all the required variant annotations? (#134)

Looks like it probably can - dbNSFP version 3.1a has: Positions:

  • hg38
  • hg37 Frequencies:
  • dbSNP
  • 100genomes
  • ESP
  • ExAC
  • UK10K Pathogenicity scores:
  • CADD
  • SIFT
  • Polyphen
  • Clinvar

the Gene annotations include the HGNC gene id to UCSC, ENSEMBL, MGI, ZFIN, MIM gene, MIM disease and MIM phenotype ids.

Using this single resource would massively simplify the data build procedure @damiansmhttps://github.com/damiansm @pnrobinson< https://github.com/pnrobinson> - can you remember the reasons for using the current, non-integrated set of data sources? Other than becoming completely dependent on a single source of data, and relinquishing the ability to independently update a datasource can you think of any other reason not to use this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/exomiser/Exomiser/issues/134>, or mute the thread< https://github.com/notifications/unsubscribe/AEtuPBqizbKGbbdCgFslVn0d25MSi56Oks5qHC0tgaJpZM4IqfYl>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-222682327, or mute the thread https://github.com/notifications/unsubscribe/AE7uPDM9nc58c_FSmfJDh5SAvoaQgWHgks5qHDJxgaJpZM4IqfYl .

julesjacobsen commented 8 years ago

@damiansm hmm, good point. Will check that.

@pnrobinson Kind of - Entrez gene id is in dbNSFP_gene file rather than variants file.

julesjacobsen commented 8 years ago

Yep, dbNSFP only contains non-synonymous variants . Bummer - Should have read the manual :(

damiansm commented 8 years ago

Or even just the acronym ;-)

On Wed, Jun 1, 2016 at 11:42 AM, Jules Jacobsen notifications@github.com wrote:

Yep, dbNSFP only contains non-synonymous variants http://www.ncbi.nlm.nih.gov/pubmed/26555599. Bummer - Should have read the manual :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-222956737, or mute the thread https://github.com/notifications/unsubscribe/AE7uPEm43HDoi2vyll6UKJb1WYRVu1NVks5qHWIYgaJpZM4IqfYl .

visze commented 8 years ago

Do they provide ExAC for GRCh38? Because I can find only a GRCh37 ExAC release on their website

martenj commented 8 years ago

No still no dataset for GRCh38. Was looking for it several times since I wanted it for the ASDPex. Am 01.06.2016 10:40 nachm. schrieb Max notifications@github.com:Do they provide ExAC for GRCh38? Because I can find only a GRCh37 ExAC release on their website

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread.

julesjacobsen commented 8 years ago

Kaviar (ref) looks like a decent contender for variant frequencies. The data sources are as folows:

Dataset SNVs Uniquea (%) Novelb (%) References

Using Kaviar and DBNSFP togther with REMM and CADD should mean that we can use tabix directly for querying sequence datasources and people can update their data by simply replacing the files meaning we don't need to build these into the database. At worst we'll have to combine Kaviar and DBNSFP into a single tabix datasource with all the variant information in. This will get us much improved variant frequencies and a simpler data build.

pnrobinson commented 8 years ago

Kaviar seems a really useful resource (btw the dbSNP in Kaviar is Version 146 not 132), and I agree it would be great to explore this option. I suspect that the majority of our users would be comfortable with this, and we will also have demo versions online etc.

visze commented 8 years ago

What about ExAC?

julesjacobsen commented 8 years ago

@pnrobinson Cool - I'll put it on the list then. @visze ExAC is in dbNSFP. I plan to merge the relevant data in dbNSFP and Kaviar.

damiansm commented 8 years ago

But ExAc is the no.1 source everyone wants - we will always need that?

On Thu, Oct 27, 2016 at 3:59 PM, Peter Robinson notifications@github.com wrote:

Kaviar seems a really useful resource (btw the dbSNP in Kaviar is Version 146 not 132), and I agree it would be great to explore this option. I suspect that the majority of our users would be comfortable with this, and we will also have demo versions online etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256666554, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPJfjVOQRAzNu56Syimln7Kt2PVe6ks5q4LxdgaJpZM4IqfYl .

damiansm commented 8 years ago

Looks like it does not have ExAC though?

On Thu, Oct 27, 2016 at 12:06 PM, Jules Jacobsen notifications@github.com wrote:

Kaviar http://db.systemsbiology.net/kaviar (ref http://bioinformatics.oxfordjournals.org/content/27/22/3216.full) looks like a decent contender for variant frequencies. Using Kaviar and DBNSFP togther with REMM and CADD should mean that we can use tabix directly for querying sequence datasources and people can update their data by simply replacing the files meaning we don't need to build these into the database. At worst we'll have to combine Kaviar and DBNSFP into a single tabix datasource with all the variant information in. This will get us much improved variant frequencies and a simpler data build.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256612099, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPOZiRJ0ro96O1yQ9lIZs8yYR_bbtks5q4IWdgaJpZM4IqfYl .

drseb commented 8 years ago

Looks like it does not have ExAC though?

But http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl?show=sources shows the item "63000exomes". Isn't this exac?

julesjacobsen commented 8 years ago

@damiansm kaviar or dbnsfp? dbnsfp has ExAC.

damiansm commented 8 years ago

But only for non-synonymous SNPs. We need to the frequencies for all the other types

On Thu, Oct 27, 2016 at 4:42 PM, Jules Jacobsen notifications@github.com wrote:

@damiansm https://github.com/damiansm kaviar or dbnsfp? dbnsfp has ExAC.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256681321, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPGi5pcPfFGNybO_FDyxNDEEM0BFIks5q4MZKgaJpZM4IqfYl .

julesjacobsen commented 8 years ago

Sorry, keep forgetting that.

So going to use Kaviar, dbNSFP and ExAC

damiansm commented 8 years ago

Hold on - seb may be right that kaviar has Exac. It just had a daft label.

The big question is whether Kaviar is respected in the community, trusted etc - especially if they are doing the hg38 liftover for everything.

On Thu, Oct 27, 2016 at 4:56 PM, Jules Jacobsen notifications@github.com wrote:

Sorry, keep forgetting that.

So going to use Kaviar, dbNSFP and ExAC

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256687383, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPJ2_VE7felGFvY1mUpre5L2WQnaIks5q4MmVgaJpZM4IqfYl .

julesjacobsen commented 8 years ago

Kaviar sources: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl?show=sources

63000exomes links to ExAC, so yes, it does have it.

damiansm commented 8 years ago

So we just need to decide if Kaviar is good. What have people heard about it at AHSG etc. Do people use it?

On Thu, Oct 27, 2016 at 5:17 PM, Jules Jacobsen notifications@github.com wrote:

Kaviar sources: http://db.systemsbiology.net/ kaviar/cgi-pub/Kaviar.pl?show=sources

63000exomes http://exac.broadinstitute.org/ links to ExAC, so yes, it does have it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256694057, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPDjLbBUHX1gSoz1SkcrxxVLeKIrUks5q4M6WgaJpZM4IqfYl .

pnrobinson commented 8 years ago

Kaviar seems to be much less well known in the community than ExAC, but it provides actually even more data (especially for the bits of the genome outside the exome). I have heard that ExAC will be providing WGS data next year. However, I think that Kaviar has sufficient "brand recognition" (e.g., Institute of Systems Biology and the authors) that we can use it without too many worries. I had been wanting to suggest this for some time now actually.

damiansm commented 8 years ago

Cool - lets go for it then Jules. Guess we can do some QC of the ExAC variants to check they are all included and lifted over accurately

Cheers Damian

On Thu, Oct 27, 2016 at 6:21 PM, Peter Robinson notifications@github.com wrote:

Kaviar seems to be much less well known in the community than ExAC, but it provides actually even more data (especially for the bits of the genome outside the exome). I have heard that ExAC will be providing WGS data next year. However, I think that Kaviar has sufficient "brand recognition" (e.g., Institute of Systems Biology and the authors) that we can use it without too many worries. I had been wanting to suggest this for some time now actually.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/exomiser/Exomiser/issues/134#issuecomment-256711942, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7uPOwdwHC2qBC2ErEiUaNsiRIONdDsks5q4N2tgaJpZM4IqfYl .

damiansm commented 7 years ago

Appears that Kaviar don't maintain the source of the frequency data though so lose ability to do population filtering. Decided to just keep existing sources for now