High duplication from BUSCO using x15 cov

rob123king commented 4 years ago

Hi,

Very nice tool as easy to run and quick which can be a major headache sometimes. I posted below and just realised I put both smrt cells fasta and fastq in so duplicated the data. I'll try that again and hopefully this helps.

Results look significantly better than HiCanu. I have 10,000 contigs. My problem is the longest contig I have is 3.5Mbp and the BUSCO has high duplication which corresponds to the genome size being 80% too large too. I think I only have x15 coverage, I have requested more data and might be able to push it to 20x but will have to wait. Is there a parameter that I could tweak to get this duplication down rather than use additional tool to do that. It is early days, and I need to play around more and read more but in case I am missing anything obvious. C:99.9%[S:19.7%,D:80.2%],F:0.1%,M:-0.0%,n:1658

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      24233 sequences, total length 2033386837 bp (including 439 repeats of total length 7159438 bp).
--   bubbles:      6428 sequences, total length 192094085 bp.
--   unassembled:  94184 sequences, total length 1000046846 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      968422           104   130308813
--     20      660064           270   260179731
--     30      498917           500   390143923
--     40      405200           791   520051788
--     50      328828          1148   650274462
--     60      275039          1581   780008382
--     70      228439          2100   910187095
--     80      188891          2723  1040077841
--     90      154826          3482  1170033072
--    100      126343          4413  1300104055
--    110      100700          5565  1430064823
--    120       75672          7052  1560023035
--    130       53416          9090  1690051034
--    140       31413         12252  1820020140
--    150       17660         17857  1950017255
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      24233 sequences, total length 3030398075 bp (including 439 repeats of total length 10703713 bp).
--   bubbles:      6428 sequences, total length 286330754 bp.
--   unassembled:  94184 sequences, total length 1482274716 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1636484            63   131210376
--     20     1228508           155   260856458
--     30      973050           274   390584764
--     40      800740           422   520250313
--     50      677931           599   650558423
--     60      598845           802   780360444
--     70      520751          1034   910162938
--     80      454585          1302  1040427748
--     90      406316          1604  1170167952
--    100      359785          1944  1300091512
--    110      317659          2330  1430305495
--    120      277794          2766  1560173378
--    130      244161          3264  1690067561
--    140      213205          3835  1820172826
--    150      185735          4490  1950080750
--    160      158965          5246  2080102254
--    170      133649          6135  2210046641
--    180      109905          7205  2340006684
--    190       87483          8527  2470045576
--    200       65031         10236  2600029303
--    210       44371         12664  2730012789
--    220       30249         16250  2860018091
--    230       18978         21591  2990006743
--

lh3 commented 4 years ago

Is this a polyploid plant genome? What file are you looking at? Is it *.p_ctg.gfa?

Note that BUSCO is not always reliable. Edger et al observed that for strawberry, BUSCO falsely regards ~90% of genes as duplicates. It may be possible that hifiasm is resolving allopolyploid haplotypes.

rob123king commented 4 years ago

Its a diploid beetle insect. I have 10X genomic illumina data for it so estimated that should be 1.3gbp and not 2gbp. I have converted *.p_ctg.gfa to fa file using gfatools. I reran without the duplicated files and made little difference C:99.8%[S:30.6%,D:69.2%],F:0.2%,M:0.0%,n:1658, so still seeing the additional BUSCO. I'll do some overlaps of the duplicate BUSCOs and see if the contigs are present in duplicate. The job completed in 25 minutes which seems quite quick, maybe the purge step didn't work, although I don't see any errors.

rob123king commented 4 years ago

Seems like BUSCO are there and very heterozygous haplotype contigs, maybe that's why not purged as so different?

lh3 commented 4 years ago

I don't know then. @chhylp123 can chime in. You should also use the more recent v0.9 which improves purging. You may also try the standalone purge_dups. It sometimes purges more.

chhylp123 commented 4 years ago

Yean, I recommend you to try our latest version (v0.10), which can avoid mis-assemblies. There are two ways:

Run hifiasm v0.10 with the option '--high-het'. Please note that this option will increase assembly time.

Run hifiasm v0.10 without the option '--high-het', and postprocess the p_ctg with Sanger's Purge_dups: https://academic.oup.com/bioinformatics/article/36/9/2896/5714742.

We are also curious what's difference between these two ways. We are still improving '--high-het'.

rob123king commented 4 years ago

Moving from version 9 to 10 made some improvement. Using highhet setting reduced the duplicated busco only marginally and a much more fragmented genome so the trade-off is not worth using the setting because you still have to purge the duplicates and the genome is more fragmented. I used two datasets from two species and had a similar result. Generally I am a bit disappointed with HiFi vs CLR data which I used to get much better continuous assemblies and from what others are showing they seem to get amazing results, although i'm not sure if that is low repeat and small genomes species with high coverage. I'm not sure if that's a limited coverage issue or still more settings to run but early days.

Moth without highhet C:97.4%[S:84.9%,D:12.5%],F:0.5%,M:2.1%,n:1658 1,438 scaffolds

Moth with highhet C:95.1%[S:84.1%,D:11.0%],F:0.7%,M:4.2%,n:1658 1,805 scaffolds

beetle without highhet C:99.8%[S:30.7%,D:69.1%],F:0.2%,M:0.0%,n:1658 9,887 scaffolds

beetle with highhet C:99.9%[S:32.7%,D:67.2%],F:0.1%,M:-0.0%,n:1658 20,624 scaffolds

chhylp123 commented 4 years ago

Thanks a lot. The results looks interesting. May I ask what's the N50 of CLR assemblies and HiFi assemblies? And what's the het rate of these two samples?

rob123king commented 4 years ago

I'll need a week or two to do something a bit more comprehensive and get back to you, so if I'm quiet that will be why. But I will post when I've run the het rate and worked a few assemblies. It's normally when progressed things further that it becomes clearer.

Early data.. kmer=31 for the beetle puts het rate at 1.9 although trying a lower kmer and adjusting settings, using the HiFi data. Changing settings made little difference. Adding a bit more data from a 3rd smrt cell at half loading gave me a better assembly at 6565 scaffolds so I think partially it's a coverage issue and finding settings and tools that give the best results. I have all the data I can get now from one individual so I just have to make the best of it now. coverage is likely x20ish. I also now have a 16Mbp contig versus longest at 3.5Mbp so big improvement with some more coverage for that. BUSCO still has high duplication. C:99.9%[S:27.8%,D:72.1%],F:0.0%,M:0.1%,n:1658

rob123king commented 4 years ago

I've had a bit more of a look of my best coverage data for one species and I can see one or two issues. Something I observed with masurca before, was assembled of large parts 500kbp-3mbp duplicated to different contigs usually showing one copy within a sequence and the other tagged on the end of another. This is not haplotype but assembly error. I usually cut off the duplicate sequence at the with the contig that has it at the end. This will explain in part my high busco duplication at least in part. Also I have 100+ small duplicated contig sequences representing the mitochondria just with different copies and from different start positions. I think there is an issue with repeats that needs looking at. I don't see this typically with Flye and Falcon but common to masurca too.

chhylp123 commented 4 years ago

Thanks a lot. May I ask how do you determine " showing one copy within a sequence and the other tagged on the end of another"?

lh3 commented 4 years ago

Just to make sure: the sample comes from one beetle. Is that right? At this point, you may consider to run the standalone purge_dups on p_ctg+a_ctg. Purge_dups can be more aggressive, which might be what you need here. I guess part of the problem is due to fragmented assembly (the shortest N50 we have seen is 5.4Mb for redwood, but yours has ~300kb). Some hifiasm heuristics might have failed.

I don't see this typically with Flye and Falcon but common to masurca too.

Flye and Falcon are designed for noisy reads. They don't intend to separate haplotypes and rely less on purge_dups. Nonetheless, HiCanu and hifiasm tend to produce better assembly overall for hifi data.

chhylp123 commented 4 years ago

I assume you have ~15x coverage, which might be a little bit low. In some cases, one haplotype has enough coverage while another doesn't have. We observed similar examples on a milkfish dataset. HiFi-specific assemblers will separate two haplotypes so the the haplotype without enough coverage is broken. Non-HiFi-specific assemblers like Falcon or Flye (I don't know how does current Flye work for HiFi) will collapse two haplotypes so if there is no enough coverage for only one haplotype, these assemblers can go through it. However, this is not what we want. This is what I guess, maybe I was wrong. You can wait more data to see the new result.

rob123king commented 4 years ago

I have 27x coverage or thereabouts but I'll check that again. I do lastz alignments in geneious which gives a nice visualisation of the sequence and works the same on the command line but you I would then look at a tab file. It's one beetle. I only have this amount of coverage from one insect as only 500ng that sequencer could get and 1x smrt cell. Genome is around 450-500Mbp size.

rob123king commented 4 years ago

I do have a high BUSCO of 72% using the same dataset with Flye so maybe just a HiFi problem rather than a software one. I assume its the usual, for an easy life you need 60x coverage or higher. I'm used to CLR and don't see high BUSCO duplication.

chhylp123 commented 4 years ago

I don't know either. Thanks for pointing out problems. If you can release the data in the future, we will have a look. Overcoming these challenges is important no matter they are caused by HiFi data itself or hifiasm.

chhylp123 / hifiasm

High duplication from BUSCO using x15 cov #22