awilfert / PSAP-pipeline

14 stars 9 forks source link

Error during PSAP annotation #4

Closed jpfeil closed 7 years ago

jpfeil commented 7 years ago

Hello,

I am trying to run the PSAP pipeline using a somatic variant callset, but I'm getting an error during the PSAP annotation step. Here is the traceback:

Read 19714 items [1] 5794 26 Read 162 items [1] "data cleaned, beginning annotation" [1] 0 13 [1] "calculating summary scores" [1] "AD model complete" [1] "AR-hom model complete" [1] "AR-het model complete" [1] 0 0 [1] "processing data" Error in [<-.data.frame(*tmp*, which(out$popScore == 0), "popScore", : replacement has 1 row, data has 0 Calls: [<- -> [<-.data.frame Execution halted

Is the script filtering out all of my variants? Does this pipeline work for somatic variants?

Thank you!

awilfert commented 7 years ago

Hi @jpfeil!

Based on the output you've posted it looks like the pipeline is removing all of your variants. We implement a few filtering steps to ensure the variants included in the analyses are comparable to those used to generate our null distributions. Specifically, we remove 1) variants in genes that are in a blacklist that was generated and represent genes with poorly calibrated PSAPs, 2) variants with discrepant allele frequencies between ExAC and 1000 Genomes or ESP (eg. missing in ExAC but common in ESP or 1000 Genomes), 3) variants in genes that are not represented in our lookup tables, 4) variants that are not present in any affected individuals, and variants in genes/genomic regions that are not covered by ExAC (eg. intergenic, intronic, UTRs). We also remove variants that don't have PASS in the FILTER column of the VCF. These filtering steps should be compatible with somatic variants, although I will note that our null models are generated from germline variants, so somatic variants may look more interesting than they actually are.

When you generate your pedigree make sure you are coding the column with your somatic variant calls as affected in column 6 of the pedigree file. You may also want to check the ANNOVAR output file (your_file.avinput_hg19.multianno.txt) to make sure you have variants that meet all of the criteria above. Finally, you can try checking the outout in the FILTER column of your VCF file. We assume that the VCF input will be generated by GATK or similar with a FILTER column that marks variants as "PASS" or not. The pipelinr currently excludes variants that do not have "PASS" in the FILTER column. You can change the FILTER criteria for the pipeline by either altering line 162 in the (generic_apply_popStat.R for family analyses, or apply_popStat_individual.R for individual analyses), or by changing whatever is currently in the FILTER column to PASS.

Hope this helps!