KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
75 stars 17 forks source link

Questions about genotype phasing for each SNV in each cell #75

Open Li-Chengyu opened 2 weeks ago

Li-Chengyu commented 2 weeks ago

Hi! Dr. Dou,

Thank you for developing this great software Monopogen for both germline and somatic SNVs detection in single-cell sequencing data. We found it very efficient in somatic mutations calling in human brain snATAC-seq data, and we are going to do some modifications to the Monopogen scripts to make it more appropriate to our own data. There are some questions when I'm studying your scripts and found it hard for me to understand.

  1. In the scrpit somatic.py, line 298: _mat=mat.groupby(by=['snvIndex','cellIndex'], asindex=False).first()

    You keep only the first allele record when scanning reads coverage for each SNV in each cell, considering the widespread allelic dropout in single-cell sequencing data. But in our snATAC-seq data, there are still 8% SNVs covered by both reference and alternative reads in one single cell. Is it a small proportion that can be ignored for the following analysis, or should we assign value 1 to the SNV in the cell when both reference and alternative reads are observed?

  2. In the script somatic.py, line 309 to 337:

    You phase the genotype for each germline SNV in each cell, but why should the phased genotype be flipped when only reference reads (value 0) are observed in the cell? In my opinion, all the phased genotypes are the same across cells for one SNV if it is germline.

Looking forward to your reply!

Sincerely, Chengyu

jinzhuangdou commented 1 week ago

1) We usually observed one allele in one cell. If your data has 8% SNVs covered by both, you may keep both when transferring bam files to the matrices. We will upgrade this function in the future. 2) Yes, all the phased genotypes are the same for one SNV. In the element phase_info, x|x, the left denote the number of reads supporting reference allele and the right for alternative allele. If your genotype typing is 1(alt)|0(ref), and you observed one ref allele in one cell, it could be write as 0(alt allele number) |1 (ref allele number)