ding-lab / CharGer

Characterization of Germline variants
https://ding-lab.github.io/CharGer/
GNU General Public License v3.0
96 stars 37 forks source link

All Uncertain Significance #18

Closed ekofman closed 5 years ago

ekofman commented 5 years ago

Hi -- I was able to successfully run CharGer, but all variants are classified as "Uncertain Significance." Is there a setting that must be tweaked for variant annotation to work correctly? I'm guessing at least some of the variants we have must not be "Uncertain Significance." I'm using the docker image on DockerHub and the command "charger -f /mount/sample_id.vt2_normalized_spanning_alleles.vcf -o /mount/charger_annotated.tsv"

fernanda-rodrigues commented 5 years ago

Hi @ekofman

Thank you for your question and for using our tool.

The reason why all your variants are being classified as uncertain significance is that you're not setting any additional parameters to CharGer. The simpler way to put this is: the more information about your variants that you give CharGer, the better your variant classification will be. For more information on ACMG guidelines implemented into CharGer, please read: https://www.nature.com/articles/gim201530.pdf

Is your input vcf file annotated with VEP? If not, you can VEP annotate your file within CharGer (please refer to README). This should improve your results a bit.

Adding some of the different parameters described in our README file should also make your analysis more precise.

For example, you can use have CharGer access the ClinVar database by using the -l flag accompanied by the --mac-clinvar-tsv file that you can download from the MacArthur lab github page (https://github.com/macarthur-lab/clinvar/tree/5b04ade4fb4d2f13ffd39e4a8d9ade9af28fdaf9). This will allow CharGer to gather information for you variants from the ClinVar database and improve variant classification. CharGer will soon allow input files downloaded directly from ClinVar, but you can use the MacArthur lab file for now.

You can also input some of the cross-reference data files described, or a allele frequency threshold for rarity (please refer to README). For an example of the CharGer tool being applied to one of our studies, please refer to our PanCan Atlas germline paper: https://www.sciencedirect.com/science/article/pii/S0092867418303635?via%3Dihub#sec4 The cross-reference data-files used in this study (pathogenic variants .vcf file, inheritanceGeneList (which includes a list of 152 known cancer predisposition genes), and a HotSpot3D clusters file) are present here: https://github.com/ding-lab/CharGer/tree/master/PanCanAtlasData These files should give you a good example of their expected formats. For a more in-depth description of some of the cross-reference data files you can use as input, please read below:

-z pathogenic variants, .vcf : this is a .vcf file with known pathogenic variants that you may compile yourself. This list is taken into account by CharGer when implementing the PS1 and PM5 ACMG evidence levels. Depending on your study, you may compile a list of known pathogenic variants (confirmed in the literature and/or ClinVar) that are specific and/or relevant to your disease.

-e expression matrix file, .tsv : this is a .tsv file, which a column for each sample, and a row for each gene. If you have expression data for the genes you’re targeting or genes of your interest, you can generate a matrix like this using RSEM, for example. If you do not input an expression matrix, CharGer will allow eligible truncations in your data set without expression data in the PVS1 evidence level. If you provide expression data, a threshold of 0.2 is used. If expression is lower than the threshold, truncation is allowed in the PVS1 evidence level. Note that the PVS1 evidence level requires the mode of inheritance to be dominant (assuming heterzygosity) and co-occurence with reduced gene expression if expression data is provided.

--inheritanceGeneList: is a tab-delimited file that should contain three columns: gene, disease, and mode of inheritance (autosomal dominant, autosomal recessive). Make sure to use approved HUGO symbols. This file should be use when you have a list of known predisposition genes you would like to input to CharGer. This list is taken into account by several evidence levels (PVS1, PSC1, PM4, PP2, and PPC1).

--PP2 Gene list: this is just a file with a gene per line (be sure to use approved HUGO symbols. Following the ACMG guidelines description, this list should include susceptibility genes that have a low rate of benign missense variation and in which missense variants are a common mechanism of disease. Missense variants in any of these genes will fall into the PP2 evidence level.

--BP1 Gene list: same format as the PP2 Gene list. Following the ACMG guidelines description, this list should include genes for which primarily truncating variants are known to cause disease. Missense variants falling in any of these genes will fall into the BP1 evidence level.

-n de novo file: this is a standard maf (mutation annotation format) file; this file should contain de novo variants with maternity and paternity confirmation and no family history. This file, if provided, is taken into account in the PS2 ACMG evidence level. If you have this information from your dataset; please provide it using this argument.

-a assumed de novo file: this is a standard maf file as above; this file should contain assumed the novo variants from your dataset; i.e. variants for which you have evidence are de novo, but do not have maternity or paternity confirmation. This file, if provided, is taken into account in the PM6 ACMG evidence level.

-c co-segregation file: this is also a standard maf file; this file should include variants cosegregating with disease in multiple affected family members in a gene definitively known to cause the disease (according to ACMG guidelines).

-H HotSpot3D clusters file: this a file can be generated by our HotSpot3d tool (https://github.com/ding-lab/hotspot3d), which identifies mutation hotspots from linear protein sequence and correlate the hotspots with known or potentially interacting domains, mutations, or drugs. If provided, this file is taken into account in the PM1 evidence level. If a germline variant is located in a mutational hot spot and/or critical and well-established functional domain (e.g. active site of an enzyme) without benign variation, the the variant is flagged with a pathogenic characterization of PM1. An example of this file, which was used in our PanCan study, is present here: https://github.com/ding-lab/CharGer/tree/master/PanCanAtlasData

Applying some of these parameters and files should improve your results. Hope this helps. Please let us know if you have any additional questions.