exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
190 stars 54 forks source link

Enable more frequent ClinVar data updates in Exomiser database #501

Closed julesjacobsen closed 3 months ago

julesjacobsen commented 1 year ago

Monthly / quarterly frequency? File format?

Related to #462 , #473

julesjacobsen commented 11 months ago

This has been implemented as a new H2 MVStore index created from the ClinVar clinvar.vcf.gz file. Previously this file was parsed twice during the variant data build - once to include the ClinVarData in the variants.mv.db file and a second time to produce the clinvar_whitelist.tsv.gz file.

The downside of this was that users had to hack the provided clinvar_whitelist with their own data each time there as a new release and also the variants.mv.db file was 99.9% identical from one data release to the next as only the ClinVar data included in it was actually updated, the rest was mostly static data.

By separating the ClinVar data (and providing easy downloads) this can be updated monthly following a ClinVar release with users only needing to download a ~55MB clinvar.mv.db file and update the version in the application.properties. For less savvy/discerning users, no action is required compared to the current workflow. The whitelist is now loaded from a user-supplied whitelist, should they have one and this is merged with a dynamically loaded set filtered from the clinvar.mv.db file. Users can disable using ClinVar as a whitelist source using exomiser.hg38.use-clinvar-whitelist=false (or the hg19 equivalent). By default this is set to true and is hidden.

The latest 2302 release hg38 data directory looks like this:

2302_hg38
├── 2302_hg38_clinvar_whitelist.tsv.gz
├── 2302_hg38_clinvar_whitelist.tsv.gz.tbi
├── 2302_hg38_genome.h2.db
├── 2302_hg38_transcripts_ensembl.ser
├── 2302_hg38_transcripts_refseq.ser
├── 2302_hg38_transcripts_ucsc.ser
└── 2302_hg38_variants.mv.db

The the updated hypothetical 2307_hg38 data release directory looks like this:

2307_hg38
├── 2307_hg38_clinvar.mv.db  # just the one file now, although its a binary blob
├── 2307_hg38_genome.h2.db
├── 2307_hg38_transcripts_ensembl.ser
├── 2307_hg38_transcripts_refseq.ser
├── 2307_hg38_transcripts_ucsc.ser
└── 2307_hg38_variants.mv.db

the next month when a new ClinVar release is built...

2307_hg38
├── 2308_hg38_clinvar.mv.db  # replaced the 2307 with the 2308 version
├── 2307_hg38_genome.h2.db
├── 2307_hg38_transcripts_ensembl.ser
├── 2307_hg38_transcripts_refseq.ser
├── 2307_hg38_transcripts_ucsc.ser
└── 2307_hg38_variants.mv.db

important! application.properties or ENV should be updated to use the new version!

exomiser.hg38.clinvar-data-version=2308

Logged on startup:

2023-07-14T17:50:01.343+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.data-directory: /data/exomiser-data
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg19.data-version: -
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg38.data-version: 2307
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.hg38.clinvar-data-version: 2308
2023-07-14T17:50:01.344+01:00  INFO 274856 --- [           main] o.m.e.a.ExomiserConfigReporter           : exomiser.phenotype.data-version: 2307

@pnrobinson note that this will require changes to LIRICAL

julesjacobsen commented 11 months ago

Still need to set up a cron and provide the data.

wsstoregene commented 8 months ago

Hello Jules, Could you please provide a more detailed guideline about generating the 2308_hg38_clinvar.mv.db file from "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20231021.vcf.gz"? To summarise what you mentioned above: after generating this file, we can just replace the whitelist file with this new one and change the following parameters in application.properties?

Thank you in advance for your help.

Kind Regards

julesjacobsen commented 8 months ago

Hi @wsstoregene, this will be delivered in the next major version, so it's not ready yet unless you're building your own exomiser and running it from the development branch.

To generate the new file you'll need to use this command:

$ java -jar exomiser-data-genome-14.0.0-SNAPSHOT.jar  --build-dir=. --assembly hg38 --version 2311 --clinvar

This will create a directory called 2311_hg38 containing the file 2311_hg38_clinvar.mv.db. You'll need to move this into your current exomiser hg38 data directory and then update the version to use in the application.properties:

exomiser.hg38.clinvar-data-version=2308

You should always run with exomiser.hg38.use-clinvar-whitelist=true, as otherwise you're losing the benefit of having the ClinVar annotations used for scoring known, high-quality P/LP variants.

Be aware that this is still subject to change as we're also doing work on adding more ACMG categories (#473) and it is likely that the data will need to be annotated for the variant effect as well.

julesjacobsen commented 7 months ago

This now requires a transcript data file too, in order to annotate the variant consequence i.e.

$ java -jar exomiser-data-genome-14.0.0-SNAPSHOT.jar --assembly hg38 --version 2311 --clinvar path/to/2309_hg38/2309_hg38_transcripts_ensembl.ser