PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
29 stars 4 forks source link

Understanding CNV Distribution and Counts Across Haplotypes #11

Closed sloth-eat-pudding closed 3 months ago

sloth-eat-pudding commented 10 months ago

I have recently begun studying CNVs (Copy Number Variations) and am utilizing the hg002 dataset for paraphase analysis. Upon comparing my paraphase results with the data available on your GitHub page, I noticed that the phase results for SMN1 vary, showing 1 to 3 haplotypes (haps). Specifically, in the case of HG00673, I am intrigued to understand whether the presence of only one haplotype corresponds to a situation of "(0+1)", which I interpret as one copy of SMN1 on one chromosome and none on the other. I am somewhat uncertain about this interpretation and would greatly appreciate your insights to clarify my understanding.

  | HG002 | HG00673 | HG01175 | HG01884 | HG02300 | NA19238 -- | -- | -- | -- | -- | -- | -- Smn1hap1 | √ | √ | √ | √ | √ | √ Smn1hap2 | √ |   | √ | √ | √ | √ Smn1hap3 |   |   |   |   | √ | √ Copies | 1+1 | 1+0 | 1+1 | 1+1 | 1+2 | 1+2   |   |   |   |   |   |   Smn2del78hap1 |   |   | √ |   |   |   Smn2hap1 | √ | √ | √ | √ | √ | √ Smn2hap2 | √ | √ |   | √ |   |   Copies | 1+1 | 1+1 | 1+0 | 1+1 | 1+0 | 1+0

If my understanding is not accurate, could you please explain how I should interpret the results to determine the corresponding number of copies and the positions associated with the haplotypes? For instance, how can I distinguish between a 0+2 and a 1+1 situation?

Thank you very much for your support and guidance.

xiao-chen-xc commented 10 months ago

Hi @sloth-eat-pudding the copy numbers are reported in the smn1_cn, smn2_cn and smn2_del78_cn fields in the json file. Even if there is only SMN1 haplotype found, as in the case of HG00673, the SMN1 copy number is 2, because there are two identical SMN1 haplotypes based on depth (you can find this info in the two_copy_haplotypes field). So HG00673 is not a 0+1 carrier. It's only when smn1_cn is called as one that a sample is a carrier.

Please also refer to this page for interpreting SMN1/SMN2.

Paraphase reports summed copy numbers of two alleles, that is, it does not phase SMN1/2 copies into two alleles at the moment. So it does not directly distinguish between 1+1 vs. 0+2. In our paper, we showed that SMN1/2 copies can be assigned to haplogroups (reported in the json file under the haplotype_details field), which can be used for statistical phasing when more population data are summarized into a database in the future. For example, S1-8 and S1-9d are very likely to be on the same allele and they could suggest a silent carrier when a sample only has these two SMN1 copies. Our paper provides an estimate of the probabilities using a limited set of samples, and these calculations can be improved as more and more population samples are sequenced and analyzed.

sloth-eat-pudding commented 10 months ago

After running the Pacbio HG00673, I carefully reviewed the output JSON file and noticed that the smn1_cn value is listed as 1. Additionally, I observed that the two_copy_haplotypes field is empty. I am concerned about these findings and wonder if I might have made an error in the process.

If providing additional information would facilitate a more comprehensive understanding of the situation, please let me know. I am more than willing to supply any necessary details to resolve this issue.

xiao-chen-xc commented 10 months ago

Hi @sloth-eat-pudding, what is the input bam file you were using for Paraphase, and could you share with me the command you ran?

xiao-chen-xc commented 10 months ago

Could you share with me the json output by Paraphase?

sloth-eat-pudding commented 10 months ago

I have rechecked the downloaded unmapped BAM file and found that it is corrupted. This has caused issues with my results post realignment. The discrepancy in the results was due to my oversight, and I apologize for any inconvenience caused. I will now restart using …

This is my output JSON. You might not need it anymore.