PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
62 stars 5 forks source link

A pipeline for known parental haplotypes? #52

Closed gokalpyildirir closed 1 year ago

gokalpyildirir commented 1 year ago

Hello,

I'm currently working on a project involving a heterokaryotic fungus that contains two known parental haplotypes within the same strain. I'm working on studying the differential methylation frequencies on these haplotypes using HiFi reads with kinetic information. I'm seeking suggestions for a pipeline that would be suitable for this analysis.

Here's an overview of my current pipeline:

  1. We possess HiFi reads with kinetic information for our strains, and I've utilized Jasmine for methylation calling on these reads.
  2. To study the respective methylations, I need to separate the reads based on their origins. To accomplish this, I aligned the reads to BOTH haplotypes using pbmm2 with the --preset HiFi parameter. I verified the mapping using IGV and observed that the reads are correctly mapped to each haplotype with minimal SNPs, and the coverage is consistent throughout the assemblies. At this stage, the reads are successfully separated between the two haplotypes.
  3. The final step involves utilizing pb-CpG-tools to calculate the methylation frequencies from the resulting BAM files. Using the latest model (pileup_calling_model.v1.tflite), I generated the .bed and .bw files.

However, I have encountered an issue during the mapping step. Upon examining IGV, I noticed that some alignments are displayed as white reads (MAPQ=0). These reads map to multiple locations with same scores, making it impossible to determine their actual origin. These regions typically consist of highly repetitive regions or areas with shared genes between the haplotypes. As a result, pb-CpG-tools disregards these reads and does not calculate the final methylation frequency for these regions. This poses a challenge for further analysis, as our primary goal is to compare the two haplotypes, and the majority of methylation events are observed in these repetitive elements. I need to obtain information on these repetitive regions.

I would greatly appreciate your feedback on my pipeline and any suggestions for modifications. Any assistance you can provide would be highly valuable.

Thank you for your support!

Best, Goko

ctsa commented 1 year ago

Hi Goko, You should be able to add the option --min-mapq 0 to work around this.

gokalpyildirir commented 1 year ago

Thank you for the suggestion @ctsa ! By doing that, I'm afraid I will introduce false positives. I imagine most of the MAPQ=0 reads are coming from identical regions between haplotypes since I am mapping the reads onto both haplotypes at the same time. If I force pb-cpg-tools to analyze these reads, it would produce the methylation frequencies but I still couldn't tell if these methylations belong to one haplotype or the other. That is why I think I need to modify my pipeline, maybe starting from the mapping step?

ctsa commented 1 year ago

Hi @gokalpyildirir , Thanks for explaining this in more detail. If the reads will not uniquely map to one of the two haplotypes, it would seem that you won't be able to use them for this analysis. Perhaps within the ambiguous regions you could also find regions with bimodal methylation that clustered into two methylation haplotypes, and attempt to assign these to the corresponding parental haplotypes.

gokalpyildirir commented 1 year ago

Thanks for the input! This is almost like trying to differentiate two faces by only focusing on the similar features... I will now try to map my reads onto one haplotype only, go through the haplotagging process and see what happens. Take care!