KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
68 stars 16 forks source link

number of snps before and after first out of imputation prior to phasing #49

Closed tbrunetti closed 3 months ago

tbrunetti commented 3 months ago

I have a question regarding the first imputation step using Beagle. My experience with imputation is strictly from a GWAS perspective where imputation from a large DNA array panel will result in the imputation of many, many more SNPs. However, in the case with scRNAseq, I actually get much much fewer snps post imputation than from the input germline file. For example, the number of snps in the germline calls after samtools mpileup commands results in 8,039,600 SNPs for chromosome 10. However after imputing against TGP, I get 275,752 SNPs (not yet phased).

Is this because we are calling all variants from mpileup on RNA so the variant calls may be inaccurate or too low of a minor allele frequency that after TGP imputation or strand discrepencies that exist, they are getting filtered out? Or is this an anomaly and there is something wrong?

The exact command I used for chr 10 is the following (in case it helps I have ~320 samples in my vcf files):

samtools mpileup -b chr10.filter.bam.lst -f genome.fa -r chr10 -q 20 -Q 20 -t DP,DPR,DV,DP4,INFO/DPR,SP -d 10000000 -v  | }bcftools view  | bcftools norm -m-both -f genome.fa| grep -v '<X>' | grep -v INDEL | $bgzip -c -@ 1 > chr10.gl.vcf.gz

java -Xmx20g -jar beagle.27Jul16.86a.jar gl=chr10.gl.vcf.gz ref=CCDG_14151_B01_GRM_WGS_2020-08-05_chr10.filtered.shapeit2-duohmm-phased.vcf.gz chrom=chr10 out=chr10.gp impute=false modelscale=2 nthreads=24 gprobs=true niterations=0

I added a few extra tag calculations to the samtools command but really that is it...

jinzhuangdou commented 3 months ago

We set the imputation option with impute=false. In such option, beagle only performs genotyping refinement on variants overlapped between scRNA-seq with 1KG3. Although we have many variants in *.gl.vcf.gz, most of them are sequencing error/RNA splicing etc. Restricting variants with 1KG3 can ensure the genotyping quality.

In addition, impute=false means we only perform genotype refinement based on LD from 1KG3. The variants listed in 1KG3 but not detected in scRNA-seq sample will not be genotyped.

tbrunetti commented 3 months ago

Thank you that makes perfect sense! I had a feeling that was the case but I wanted to make sure I was interpreting that correctly.