PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
23 stars 4 forks source link

Understanding the exact workings of paraphase #18

Closed youpze closed 2 months ago

youpze commented 2 months ago

Hi Dr.Chen,

I’m writing this issue because I want to clear up some confusion about the exact working of paraphase.

Firstly, I’m using amplified PacBio Hifi data to analyse the PMS2 region. Even though the input requirements clearly state the use of “whole-genome sequencing or hybrid capture-based enrichment data”, I found in this article in the discussion that amplified data can be used. Can phasing be influenced using amplicon data instead of WGS or hybrid capture data?

Secondly, I wondered how the separation of PMS2 and PMS2CL works. In the code pms2_phaser line 72 to 82, PMS2 and PMS2CL are separated on the length of alternative bases (indicated by "2") at the start of the haplotype. Is my concept correct and why is it like this?

Thirdly, the config.yaml file contains a predefined variable “noisy_regions”.
noisy_region: [[5980880, 5980980](?), [5979663, 5979669](T-stretch), [5975950, 5976080](GG and CC repeats)] I looked at the regions of the human genome (38) in IGV and found that the second region had a T-stretch, the third had some kind of dinucleotide GG repeat, but I couldn’t identify the “noisy” part of the first region. Why is the first region predefined as “noisy” and why are these exact regions predefined?

Finally, I have a question about the function run_without_realign(). The program was taking a rather long time (+- 5-6 hours) to analyse a subset of around 2800 amplified PacBio Hifi reads. Using Cprofiler, I found that this function takes most of the time (+- 97%). To my knowledge this is due to the next 2 for loops located in another for loop to extract the pileup columns:

for read in pileupcolumn.pileups: 
  read_tag = read.alignment.get_tag("HP") 
  this_position_hps.append(read_tag) 
  read_name = read.alignment.query_name 
  new_read_name = read_name 
  if self.use_supplementary and read.alignment.is_supplementary: 
    new_read_name = ( 
    read_name + f"_sup_{read.alignment.reference_start}" 
    ) 
    this_pos_read_names.append(read_name) 
    this_pos_read_names_sup.append(new_read_name) 

for read_num, read_hap in enumerate(this_position_hps): 
  if read_hap == hap_name: 
    pileups_raw.setdefault(pos, []).append(this_pos_bases[read_num]) 
    if self.use_supplementary: 
    read_names.setdefault(pos, []).append( 
    this_pos_read_names_sup[read_num] 
    ) 
  else: 
    read_names.setdefault(pos, []).append( 
    this_pos_read_names[read_num] 

Could this part take longer due the used amplicon data or is there another explanation?

Thanks for your consideration and time.

xiao-chen-xc commented 2 months ago

Hi @youpze Paraphase does not work for amplicon data currently. This is because amplicon data does not require the phasing approach implemented in Paraphase, and it can simply be analyzed with a consensus approach (we have a HiFi amplicon workflow that you can try). The paper mentioned amplicon data because Paraphase could potentially be useful if adapted for tiled amplicons for a long and complicated region like SMN1/SMN2.

PMS2 and PMS2CL are differentiated by their sequences in Exon10-11 region, where the sequence similarity is already very low. The noisy regions mark some regions that may have some alignment errors so that Paraphase does not use those sites during phasing.

The program is taking a long time because it's not designed for amplicon data where the depth is high. Again the consensus approach would be sufficient here, and it would be fast.

youpze commented 2 months ago

@xiao-chen-xc Thanks for your fast response and clear explanation!