Open pjm43 opened 1 year ago
Hello Jeff,
By design, Inspector is able to separate true assembly errors and heterozygous regions based on the ratio of error-supporting reads. So, it is okay to use all reads from both haplotypes to evaluate phased assemblies. Combining two haplotypes into a single assembly is an alternative way, and it will in fact be more accurate as reads can be assigned to the correct haplotype. But you will only get one quality report for the merged assembly instead of for individual phased assemblies. I would suggest to run Inspector with all reads on two phased assemblies separately. Thanks!
Best, Maggi
Hi Maggi,
Thanks for the feedback. I wanted to share with you the results of running inspector with just one of the haplotypes as compared to when I included them together in the analysis (hap1 & hap2):
Just one of the haplotypes (hap2):
Statics of contigs:
Number of contigs 2274
Number of contigs > 10000 bp 2274
Number of contigs >1000000 bp 549
Total length 9586505548
Total length of contigs > 10000 bp 9586505548
Total length of contigs >1000000bp 9410063142
Longest contig 164128040
Second longest contig length 134502060
N50 37417219
N50 of contigs >1Mbp 37417219
Read to Contig alignment:
Mapping rate /% 99.96
Split-read rate /% 17.25
Depth 14.1365
Mapping rate in large contigs /% 97.73
Split-read rate in large contigs /% 17.32
Depth in large conigs 14.0832
Structural error 1700
Expansion 842
Collapse 585
Haplotype switch 265
Inversion 8
Small-scale assembly error /per Mbp 246.691350478
Total small-scale assembly error 2364908
Base substitution 2272362
Small-scale expansion 45447
Small-scale collapse 47099
QV 33.7263739746```
When I concatenated them together (Hap1 + Hap2):
Statics of contigs:
Number of contigs 5733
Number of contigs > 10000 bp 5732
Number of contigs >1000000 bp 1195
Total length 19003011697
Total length of contigs > 10000 bp 19003002123
Total length of contigs >1000000bp 18586561184
Longest contig 283643808
Second longest contig length 175008206
N50 31402244
N50 of contigs >1Mbp 31402244
Read to Contig alignment:
Mapping rate /% 99.98
Split-read rate /% 0.42
Depth 13.6995
Mapping rate in large contigs /% 97.84
Split-read rate in large contigs /% 0.41
Depth in large conigs 13.7106
Structural error 18
Expansion 10
Collapse 8
Haplotype switch 0
Inversion 0
Small-scale assembly error /per Mbp 0.697047756679
Total small-scale assembly error 13246
Base substitution 8179
Small-scale expansion 2392
Small-scale collapse 2675
QV 60.9174897428
I'm not sure why there is such a dramatic difference between the two approaches (in terms of Structural error/small-scale error). Any ideas? Hifiasm produced two equally and correctly sized haplotype assemblies which suggested that the primary draft assembly was good (i.e., the one we are working with here). We are now producing HiC data which hopefully should scaffold the genome to chromosome scale.
Thanks in advance for any insight,
Jeff
I'd like to use inspector on phased haplotigs (from hifiasm; i.e., hap1.fa, hap2.fa) from a highly heterozygous plant species (the phasing looks good). I'm assuming I would run inspector against each of the hap.fa assemblies. Are there any caveats that I should understand in doing this - i.e., since only one haplotig assembly is given, but all of the hifi reads are given (from both haplotypes), do I need to worry about inspector erroneously identifying heterozygousity as an error? Hopefully this makes sense. Maybe what I should do is combine both haplotypes into a single assembly and run it as a single assembly..??
Any help would be greatly appreciated,
Jeff Maughan