PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
36 stars 5 forks source link

Most pseudogenes have no read alignment from Paraphase output #26

Open minw2828 opened 3 weeks ago

minw2828 commented 3 weeks ago

Hello,

Thank you for developing the tool. :)

It is stated that:

Paraphase takes all reads from a gene family, realigns to one representative gene of the family and then phases them into haplotypes.

Although long reads have trouble aligning to PMS2 pseudogenes other than PMS2CL, in the paraphrase outputs, no reads align to most PMS2 pseudogenes, except PMS2CL.

Is this because paraphase was designed to consider PMS2 and PMS2CL jointly, but not jointly with the other pseudogenes?

Many thanks, Min

xiao-chen-xc commented 2 weeks ago

Hi Min,

Paraphase is designed to consider PMS2 and PMS2CL jointly, because there is no misalignment between PMS2 and other pseudogenes. If you see misalignments to other pseudogenes, could you give an example?

Thanks, Xiao

minw2828 commented 1 week ago

Hi Xiao,

Thank you for responding to me.

May I ask how you identify misalignments please?

Shown below are read alignments to the PMS2P2 and PMS2P5 genes. The alignments were sorted by mapping quality and shade by mapping quality high. In each figure, the top panel is the pbmm2 alignment, and the bottom panel is the paraphase alignment.

We can see that reads with low mapping qualities were mapped to the PMS2P2 and PMS2P5 genes by pbmm2. Reads mapping to the PMS2P2 gene have a lower mapping quality than those mapping to the PMS2P5 gene. Reads mapping to the downstream of the PMS2P5 gene, especially those spanning across the deletion, have a mapping quality of 60.

Would you consider reads with lower mapping qualities as misalignment? Perhaps those reads could have been realigned to other PMS2 pseudogenes?

Screenshot 2024-11-21 at 9 51 45 am

Screenshot 2024-11-21 at 9 52 31 am

Many thanks, Min

xiao-chen-xc commented 1 week ago

Hi Min,

Yes the low MAPQs reflect that there are mapping issues. This is because those pseudogenes (those named PMS2P#) have high sequence similarity between each other. This is a separate problem from PMS2-PMS2CL, as PMS2-PMS2CL and PMS2P# are very different in sequence.

Paraphase is centered on genes so far, so we haven't included those pseudogene-only families. Are you interested in studying these psuedogenes even when they are not homologous to PMS2?

Thanks, Xiao

minw2828 commented 1 week ago

Hi Xiao,

Could you take a look at Table 2 in this publication please? https://www.mdpi.com/1422-0067/24/2/1398

Many thanks, Min

xiao-chen-xc commented 1 week ago

Hi Min,

A HiFi read is 100 times longer than a short read, so it provides much more information in alignment. Therefore, a region that has alignment problems due to sequence homology in short read data may not have any alignment problem in long read data. If you align PMS2 to the entire genome, aside from matches to PMS2CL, the remaining matches are all shorter than 4kb at a sequence similarity of 91% or lower- these are different enough and short enough, and would not create any alignment issues for HiFi reads.

If you do see any misalignment between PMS2 and PMS2P# genes, please share them here and I'd be happy to look into it.

Thanks, Xiao

minw2828 commented 1 week ago

Hi Xiao,

Thank you for developing the tool paraphase. It is a great tool, and I would love to see HiFi reads being applied to more applications. Texts are all we have now, although texts might not be the best way to communicate as all the tones and non-verbal communication are neglected. If my words came across as picky and harsh, I apologise in advance. I did not mean to.

The IGV screenshots that I showed 5 days ago are HiFi reads aligning to the PMS2P2 and PMS2P5 genes, using the default settings of pbmm2. PMS2 and its pseudogenes are called challenging medically relevant genes due to their high sequence similarities to each other. In this case, HiFi reads also had difficulty finding their unique mapping location.

I think it would be really helpful if paraphrase could expand its joint consideration from PMS2 and PMS2CL to PMS2 and all its pseudogenes. That's the only reason why I submitted this ticket in the first place.

Many thanks, Min

xiao-chen-xc commented 1 week ago

Hi Min,

If your goal is to call variants in PMS2, you can use the current setup in Paraphase, It’s only necessary to consider PMS2 and PMS2CL. Other pseudogenes (PMS2P#) will not cause alignment problems in PMS2, i.e. no PMS2P# reads will align to PMS2 and no PMS2 reads will align to PMS2P#. This is because the sequence similarity between PMS2 and PMS2P# genes is low enough for HiFi reads to align correctly.

If your goal is to instead call variants in those PMS2P# genes, then we would need to add a new region definition in Paraphase so that PMS2P# genes can be considered together as a group (this group does not include PMS2/PMS2CL). This is because PMS2P# genes are highly similar in sequence to each other and there can be misalignments among PMS2P# genes, as seen in the examples you shared. Again, we do not expect misalignments between PMS2 and PMS2P# genes.

Please note that all my statements above are based on HiFi data. It’s a different story for short reads.

Thanks, Xiao