KolmogorovLab / hapdup

Pipeline to convert a haploid assembly into diploid
Other
90 stars 10 forks source link

Contigs without assigned phaseblocks in their name #38

Open sivico26 opened 10 months ago

sivico26 commented 10 months ago

Hi @fenderglass,

Thanks for developing Hapdup. I am trying to phase some loci of an allopolyploid plant into what should be the 2 subgenomes of its parents. After checking the output, I have some questions. I will use one of the assemblies as an example.

For one of my locus if I look into the hapdup_phased_* assemblies, I can see the following names for hap1:

contig_10_phaseblock_0  33510
contig_12       5153
contig_14_phaseblock_0  11146
contig_16_phaseblock_0  9374
contig_20       34460
contig_23_phaseblock_0  35265
contig_7_phaseblock_0   8359
contig_7_phaseblock_1   66335

While for hap2 it is:

contig_10_phaseblock_0  31793
contig_12       5153
contig_14_phaseblock_0  9818
contig_16       13178
contig_20_phaseblock_0  38831
contig_23_phaseblock_0  36404
contig_7_phaseblock_0   7893
contig_7_phaseblock_1   66913

As you can see, most of the contigs have their homolog in both haplotypes (contigs 7, 10, 14, and 23). But there are other two categories that confuse:

mikolmogorov commented 10 months ago

Hi,

That's unexpected, I think it probably represents an error in hapdup rather that something meaningful.. It is likely some kind of an edge case, where phasing block boundary is very close to contig end, but coordinates shifted slightly in different haplotypes. As a result, hapdup split contig_16 in HP1, but not in HP2.

In dual assembly mode this should not happen, but for the phasing mode I'll try to fix that in the future releases.

sivico26 commented 10 months ago

All right, if you need some data to debug this, let me know.

I am wondering, if the assembly has some redundancy, do you think it could lead/facilitate this problem? I am working with Flye assemblies, but I have not checked if there is redundancy on those.

mikolmogorov commented 10 months ago

How big is your dataset? If you could send it somehow, that would be helpful! Feel free to email mikolmogorov@gmail.com

I don't think this is specific to the genome, just a borderline case.