ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

How to keep large AMBIGUOUS contig in pangenome graph #1389

Open Han-Cao opened 1 month ago

Han-Cao commented 1 month ago

Hi,

After running the cactus-graphmap-split, I found a few samples can have very large _AMBIGUOUS_ contig. I manually checked some samples and found it is due to a large contig that mapped to 2 chromosomes.

For example, one sample has a 94MB contig mapped to chr9 (56MB) and chr11 (38MB)

Query contig is ambiguous: id=GA22003.1|h1tg000012l  len=94724277 cov=0.594907 (vs 0.25) uf=1.47363 (vs 2)
 Reference contig mappings:
  chr9: 56352107
  chr11: 38240300

After split, this sample lacks a large proportion of chr 9 and chr11 in one assembly.

Contig _AMBIGUOUS_ chr11 chr9
CHM13 0 135127769 150617247
GA22003.1 128307255 94200234 66460248
GA22003.2 24030743 138503865 128887431

According to the config of cactus-graphmap-split, I can keep this contig by adjusting the threshold of uf. However, it will affect more contigs and also cannot keep the sequence for both chromosomes.

So, my questions are:

  1. For such large contig mapped to 2 chromosomes, is it more likely to be interchromosomal event or assembly/sequencing error? This human genome assembly was generated from 30X PacBio HiFi-only data using hifiasm.
  2. If it is caused by interchromosomal events, can I manually split the large contig into 2 contigs before running the pangenome pipeline? I am thinking to first align individual assemblies to CHM13 and manually split large contigs (e.g., >10MB) with misjoin of 2 chromosomes (e.g., reported by paftools.js misjoin) into 2 contigs. Then, run the MC pipeline using the processed assemblies. Will this process introduce new issue in the final pangenome graph?

Thank you!

glennhickey commented 1 month ago

I can't say for sure from here, but I would suspect a misjoin in your assembly. We ran into this issue in the original HPRC paper for HG02080#1#JAHEOW010000073.1 and I manually split it as described here.

Whether it's an artifact or somehow a real event, your best bet is to probably split it. The other alternative is to use --noSplit and just let Cactus align all the different chromosomes together. The issue in doing this is that you're potentially letting data artifacts add a lot of complexity to your graph.

Han-Cao commented 1 month ago

Thank you very much! I will follow the tutorial to manually split them.

Besides, according to the Figure 1B of the HPRC paper, there are many interchromosomal joins. May I know why you decided to only manually split the HG02080#1#JAHEOW010000073.1 contig?

glennhickey commented 1 month ago

This one was the only one whose alignment we were confident enough in -- the others were all in very repetitive regions like centromeres and rdna arrays etc.