phased haplotigs - Githubissues

pjm43 commented 1 year ago

I'd like to use inspector on phased haplotigs (from hifiasm; i.e., hap1.fa, hap2.fa) from a highly heterozygous plant species (the phasing looks good). I'm assuming I would run inspector against each of the hap.fa assemblies. Are there any caveats that I should understand in doing this - i.e., since only one haplotig assembly is given, but all of the hifi reads are given (from both haplotypes), do I need to worry about inspector erroneously identifying heterozygousity as an error? Hopefully this makes sense. Maybe what I should do is combine both haplotypes into a single assembly and run it as a single assembly..??

Any help would be greatly appreciated,

Jeff Maughan

Maggi-Chen commented 1 year ago

Hello Jeff,

By design, Inspector is able to separate true assembly errors and heterozygous regions based on the ratio of error-supporting reads. So, it is okay to use all reads from both haplotypes to evaluate phased assemblies. Combining two haplotypes into a single assembly is an alternative way, and it will in fact be more accurate as reads can be assigned to the correct haplotype. But you will only get one quality report for the merged assembly instead of for individual phased assemblies. I would suggest to run Inspector with all reads on two phased assemblies separately. Thanks!

Best, Maggi

pjm43 commented 1 year ago

Hi Maggi,

Thanks for the feedback. I wanted to share with you the results of running inspector with just one of the haplotypes as compared to when I included them together in the analysis (hap1 & hap2):

Just one of the haplotypes (hap2):

Statics of contigs:
Number of contigs       2274
Number of contigs > 10000 bp    2274
Number of contigs >1000000 bp   549
Total length    9586505548
Total length of contigs > 10000 bp      9586505548
Total length of contigs >1000000bp      9410063142
Longest contig  164128040
Second longest contig length    134502060
N50     37417219
N50 of contigs >1Mbp    37417219

Read to Contig alignment:
Mapping rate /% 99.96
Split-read rate /%      17.25
Depth   14.1365
Mapping rate in large contigs /%        97.73
Split-read rate in large contigs /%     17.32
Depth in large conigs   14.0832

Structural error        1700
Expansion       842
Collapse        585
Haplotype switch        265
Inversion       8

Small-scale assembly error /per Mbp     246.691350478
Total small-scale assembly error        2364908
Base substitution       2272362
Small-scale expansion   45447
Small-scale collapse    47099
QV      33.7263739746```

When I concatenated them together (Hap1 + Hap2):

Statics of contigs:
Number of contigs       5733
Number of contigs > 10000 bp    5732
Number of contigs >1000000 bp   1195
Total length    19003011697
Total length of contigs > 10000 bp      19003002123
Total length of contigs >1000000bp      18586561184
Longest contig  283643808
Second longest contig length    175008206
N50     31402244
N50 of contigs >1Mbp    31402244

Read to Contig alignment:
Mapping rate /% 99.98
Split-read rate /%      0.42
Depth   13.6995
Mapping rate in large contigs /%        97.84
Split-read rate in large contigs /%     0.41
Depth in large conigs   13.7106

Structural error        18
Expansion       10
Collapse        8
Haplotype switch        0
Inversion       0

Small-scale assembly error /per Mbp     0.697047756679
Total small-scale assembly error        13246
Base substitution       8179
Small-scale expansion   2392
Small-scale collapse    2675
QV      60.9174897428

I'm not sure why there is such a dramatic difference between the two approaches (in terms of Structural error/small-scale error). Any ideas? Hifiasm produced two equally and correctly sized haplotype assemblies which suggested that the primary draft assembly was good (i.e., the one we are working with here). We are now producing HiC data which hopefully should scaffold the genome to chromosome scale.

Thanks in advance for any insight,

Jeff

Maggi-Chen / Inspector

phased haplotigs #18