PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
29 stars 4 forks source link

Invalid value encountered in scalar divide? low or highly variable genome coverage #10

Closed evayfang2019 closed 8 months ago

evayfang2019 commented 11 months ago

Hi Dr. Chen,

When I ran the Paraphase, some errors occurred.

paraphase -b ./S3/SMN1B/01.1st.map.cal.phs/S3.bam \
    -o ./parapahse \
    -r hg38.fa \
    -g smn1 \
    -t 32 

Report as follows:

INFO:root:Processing sample S3 at 2023-10-24 20:36:37.726141... INFO:root:Getting genome depth for sample S3 at 2023-10-24 20:36:37.726362... /share/home/xyz/miniconda3/envs/pacbio/lib/python3.10/site-packages/paraphase/genome_depth.py:32: RuntimeWarning: invalid value encountered in scalar divide self.mad = np.median([abs(a - self.mdepth) for a in depth]) / self.mdepth WARNING:root:For sample S3, due to low or highly variable genome coverage, genome coverage is not used for depth correction. INFO:root:Running analysis for smn1 for sample S3 at 2023-10-24 20:36:37.811012... INFO:root:Realigning reads for smn1 for sample S3 at 2023-10-24 20:36:37.811390... INFO:root:Phasing haplotypes for smn1 for sample S3 at 2023-10-24 20:36:38.399611... WARNING:root:This region does not appear to have coverage. Will not attempt to phase haplotypes. INFO:root:Tagging reads for smn1 for sample S3 at 2023-10-24 20:36:38.572544... INFO:root:Generating VCFs for smn1 for sample S3 at 2023-10-24 20:36:38.573307... INFO:root:Merging all bams for sample S3 at 2023-10-24 20:36:38.621357... INFO:root:Writing to json for sample S3 at 2023-10-24 20:36:38.716192... INFO:root:Completed Paraphase analysis at 2023-10-24 20:36:38.723153... Tue Oct 24 20:36:38 CST 2023

The outputs: S3.json, S3_realigned_tagged.bam.bai, S3_realigned_tagged.bam, and a folder named "S3_vcfs" without file.

S3.json

{
    "smn1": {
        "smn1_cn": null,
        "smn2_cn": null,
        "smn2_del78_cn": null,
        "smn1_read_number": null,
        "smn2_read_number": null,
        "smn2_del78_read_number": null,
        "highest_total_cn": null,
        "smn1_haplotypes": null,
        "smn2_haplotypes": null,
        "smn2_del78_haplotypes": null,
        "assembled_haplotypes": null,
        "two_copy_haplotypes": null,
        "sites_for_phasing": null,
        "unique_supporting_reads": null,
        "het_sites_not_used_in_phasing": null,
        "homozygous_sites": null,
        "haplotype_details": null,
        "variant_genotypes": null,
        "nonunique_supporting_reads": null,
        "read_details": null,
        "final_haplotypes": null,
        "genome_depth": null
    }
}

The input bam file from:

PacBio sequel II generated the sequence data using target sequencing. Next, I used the pbccs, lima, pbmarkdup, and got S3.bam finally.

Could you please help me solve this problem? If possible, I can send you the .bam file.

THANK YOU!

Evayfang

xiao-chen-xc commented 11 months ago

Hi @evayfang2019, based on the output message, your bam has no or little coverage in the smn1/smn2 region. I'd be happy to take a look at your bam file if you could share it with me (either here or you could reach me at xchen@pacificbiosciences.com).

xiao-chen-xc commented 11 months ago

Hi @evayfang2019, thanks for sending the data over email. The reason that Paraphase didn’t produce results was that it’s expecting coverage across the whole gene but your data only has coverage in Exon2-6. The biggest problem with your data is that it’s missing coverage in Exon7, which contains the critical site to distinguish between SMN1 and SMN2. So even if we bypass that coverage check in Paraphase, it won’t give you any meaningful result. I would suggest that you add/optimize probes for Exon7-8 to get enough coverage there.

evayfang2019 commented 11 months ago

Hi Dr. Chen, thanks for your help and suggestions. By the way, the probes we designed also cover the CYP21A2 region. However, the same problem occurred when we ran the paraphrase. Could you help me again and check the CYP21A2 region?

image

xiao-chen-xc commented 11 months ago

Paraphase was originally designed to work with WGS data, so it expects coverage over the entire RCCX module (chr6:31980000-32046800). This includes C4A/C4B, CYP21A2/CYP21A1P and part of TNXB. It is very useful to examine the entire RCCX module because we can phase the different copies of the module into two alleles, especially when copy number changes are very common this region. Your data is lacking coverage in other genes of RCCX, so Paraphase thought there is not enough coverage for analysis. I changed Paraphase to bypass the coverage check and this is what I got.

s3_rccx

We can see that there are three haplotypes of (partial) RCCX here. The first one contains CYP21A1P and the other two contain CYP21A2. The third haplotype has a pathogenic variant in CYP21A2, Q319X. So it looks like we can recover CYP21A2 copies in your data. Paraphase V3 will do a more careful check of coverage before analysis and with the new version you will be getting this result. V3 will be released next week, so please check it out.

xiao-chen-xc commented 11 months ago

Including Paraphase result for a WGS dataset here so you can see the difference. If we have coverage over the entire RCCX, we can do better phasing, and we can further phase haplotypes into alleles. Green and purple represent two alleles here. This sample has three copies of RCCX on one allele (purple) and two of them are CYP21A2.

Screen Shot 2023-10-27 at 9 38 53 AM

evayfang2019 commented 11 months ago

@xiao-chen-xc, thanks for your enthusiastic help. Excellent and meaningful work! Next, we will add probes to get enough coverage as you suggested.

evayfang2019 commented 11 months ago

Paraphase was originally designed to work with WGS data, so it expects coverage over the entire RCCX module (chr6:31980000-32046800). This includes C4A/C4B, CYP21A2/CYP21A1P and part of TNXB. It is very useful to examine the entire RCCX module because we can phase the different copies of the module into two alleles, especially when copy number changes are very common this region. Your data is lacking coverage in other genes of RCCX, so Paraphase thought there is not enough coverage for analysis. I changed Paraphase to bypass the coverage check and this is what I got.

s3_rccx

We can see that there are three haplotypes of (partial) RCCX here. The first one contains CYP21A1P and the other two contain CYP21A2. The third haplotype has a pathogenic variant in CYP21A2, Q319X. So it looks like we can recover CYP21A2 copies in your data. Paraphase V3 will do a more careful check of coverage before analysis and with the new version you will be getting this result. V3 will be released next week, so please check it out.

Hi, @xiao-chen-xc. I am so excited that you updated the paraphrase today. However, the same problem appeared again. And there are no HP and YC tags in _sample_realignedtagged.bam. So, I can't visualize it as you showed last time.

xiao-chen-xc commented 11 months ago

Hi @evayfang2019, Paraphase V3 is not updated on conda yet. I'm still waiting for the recipe update PR to be approved. So what you were running is probably an earlier version (you can check with paraphase -v). Perhaps you could wait another day, update Paraphase and try again.

Update: Paraphase V3 is on conda now.