PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
23 stars 4 forks source link

Dealing with uneven coverage #15

Open keremozdel opened 5 months ago

keremozdel commented 5 months ago

Hello,

Thank you for this great tool! I'm working with amplicon sequencing data capturing SMN gene. However, my data demonstrate uneven coverage along the SMN1 gene, which has resulted in questionable phasing output. This is my first time struggling with a phasing experiment, and I was wondering if you have any suggestions regarding this issue? For example, can performing downsampling in the regions with higher coverage help? Also, could you please briefly explain why entire genome alignment is required for the phasing process instead of using a targeted reference sequence? I'm quite new to this field and it will help me understand the subject better.

Thank you very much for your guidance and insight.

xiao-chen-xc commented 5 months ago

Hi @keremozdel Paraphase is designed to work with shotgun type data like WGS and hybrid capture data. Amplicon data is very different and should be analyzed differently. Instead of phasing reads into haplotypes, with amplicon data you could simply cluster reads into consensus groups as they all start and end at the same positions. Are you capturing the SMN genes in just one amplicon? You can try the HiFi amplicon workflow (https://github.com/PacificBiosciences/hifi-amplicon-workflow). The clustering tool it uses is pbaa (https://github.com/PacificBiosciences/pbAA).

keremozdel commented 5 months ago

Hi again @xiao-chen-xc ,

I'm working with multiplex PCR, which includes several amplicons. Would it be correct to say that amplicon data does not provide sufficient resolution for accurate phasing? I will look into the clustering as you suggested.

Thank you for your help.

xiao-chen-xc commented 5 months ago

Would it be correct to say that amplicon data does not provide sufficient resolution for accurate phasing?

No, you can get accurate haplotypes out of amplicon data through clustering. With multiple amplicons, you can cluster each amplicon first and then piece together the consensus sequences from several amplicons.

We have some experience working with SMN amplicon data internally. I'd be happy to take a look at your data if you like. Feel free to email me at xchen@pacb.com.