Closed rob123king closed 3 years ago
Is this a polyploid plant genome? What file are you looking at? Is it *.p_ctg.gfa?
Note that BUSCO is not always reliable. Edger et al observed that for strawberry, BUSCO falsely regards ~90% of genes as duplicates. It may be possible that hifiasm is resolving allopolyploid haplotypes.
Its a diploid beetle insect. I have 10X genomic illumina data for it so estimated that should be 1.3gbp and not 2gbp. I have converted *.p_ctg.gfa to fa file using gfatools. I reran without the duplicated files and made little difference C:99.8%[S:30.6%,D:69.2%],F:0.2%,M:0.0%,n:1658, so still seeing the additional BUSCO. I'll do some overlaps of the duplicate BUSCOs and see if the contigs are present in duplicate. The job completed in 25 minutes which seems quite quick, maybe the purge step didn't work, although I don't see any errors.
Seems like BUSCO are there and very heterozygous haplotype contigs, maybe that's why not purged as so different?
I don't know then. @chhylp123 can chime in. You should also use the more recent v0.9 which improves purging. You may also try the standalone purge_dups. It sometimes purges more.
Yean, I recommend you to try our latest version (v0.10), which can avoid mis-assemblies. There are two ways:
- Run hifiasm v0.10 with the option '--high-het'. Please note that this option will increase assembly time.
- Run hifiasm v0.10 without the option '--high-het', and postprocess the p_ctg with Sanger's Purge_dups: https://academic.oup.com/bioinformatics/article/36/9/2896/5714742.
We are also curious what's difference between these two ways. We are still improving '--high-het'.
Moving from version 9 to 10 made some improvement. Using highhet setting reduced the duplicated busco only marginally and a much more fragmented genome so the trade-off is not worth using the setting because you still have to purge the duplicates and the genome is more fragmented. I used two datasets from two species and had a similar result. Generally I am a bit disappointed with HiFi vs CLR data which I used to get much better continuous assemblies and from what others are showing they seem to get amazing results, although i'm not sure if that is low repeat and small genomes species with high coverage. I'm not sure if that's a limited coverage issue or still more settings to run but early days.
Moth without highhet C:97.4%[S:84.9%,D:12.5%],F:0.5%,M:2.1%,n:1658 1,438 scaffolds
Moth with highhet C:95.1%[S:84.1%,D:11.0%],F:0.7%,M:4.2%,n:1658 1,805 scaffolds
beetle without highhet C:99.8%[S:30.7%,D:69.1%],F:0.2%,M:0.0%,n:1658 9,887 scaffolds
beetle with highhet C:99.9%[S:32.7%,D:67.2%],F:0.1%,M:-0.0%,n:1658 20,624 scaffolds
Thanks a lot. The results looks interesting. May I ask what's the N50 of CLR assemblies and HiFi assemblies? And what's the het rate of these two samples?
I'll need a week or two to do something a bit more comprehensive and get back to you, so if I'm quiet that will be why. But I will post when I've run the het rate and worked a few assemblies. It's normally when progressed things further that it becomes clearer.
Early data.. kmer=31 for the beetle puts het rate at 1.9 although trying a lower kmer and adjusting settings, using the HiFi data. Changing settings made little difference. Adding a bit more data from a 3rd smrt cell at half loading gave me a better assembly at 6565 scaffolds so I think partially it's a coverage issue and finding settings and tools that give the best results. I have all the data I can get now from one individual so I just have to make the best of it now. coverage is likely x20ish. I also now have a 16Mbp contig versus longest at 3.5Mbp so big improvement with some more coverage for that. BUSCO still has high duplication. C:99.9%[S:27.8%,D:72.1%],F:0.0%,M:0.1%,n:1658
I've had a bit more of a look of my best coverage data for one species and I can see one or two issues. Something I observed with masurca before, was assembled of large parts 500kbp-3mbp duplicated to different contigs usually showing one copy within a sequence and the other tagged on the end of another. This is not haplotype but assembly error. I usually cut off the duplicate sequence at the with the contig that has it at the end. This will explain in part my high busco duplication at least in part. Also I have 100+ small duplicated contig sequences representing the mitochondria just with different copies and from different start positions. I think there is an issue with repeats that needs looking at. I don't see this typically with Flye and Falcon but common to masurca too.
Thanks a lot. May I ask how do you determine " showing one copy within a sequence and the other tagged on the end of another"?
Just to make sure: the sample comes from one beetle. Is that right? At this point, you may consider to run the standalone purge_dups on p_ctg+a_ctg. Purge_dups can be more aggressive, which might be what you need here. I guess part of the problem is due to fragmented assembly (the shortest N50 we have seen is 5.4Mb for redwood, but yours has ~300kb). Some hifiasm heuristics might have failed.
I don't see this typically with Flye and Falcon but common to masurca too.
Flye and Falcon are designed for noisy reads. They don't intend to separate haplotypes and rely less on purge_dups. Nonetheless, HiCanu and hifiasm tend to produce better assembly overall for hifi data.
I assume you have ~15x coverage, which might be a little bit low. In some cases, one haplotype has enough coverage while another doesn't have. We observed similar examples on a milkfish dataset. HiFi-specific assemblers will separate two haplotypes so the the haplotype without enough coverage is broken. Non-HiFi-specific assemblers like Falcon or Flye (I don't know how does current Flye work for HiFi) will collapse two haplotypes so if there is no enough coverage for only one haplotype, these assemblers can go through it. However, this is not what we want. This is what I guess, maybe I was wrong. You can wait more data to see the new result.
I have 27x coverage or thereabouts but I'll check that again. I do lastz alignments in geneious which gives a nice visualisation of the sequence and works the same on the command line but you I would then look at a tab file. It's one beetle. I only have this amount of coverage from one insect as only 500ng that sequencer could get and 1x smrt cell. Genome is around 450-500Mbp size.
I do have a high BUSCO of 72% using the same dataset with Flye so maybe just a HiFi problem rather than a software one. I assume its the usual, for an easy life you need 60x coverage or higher. I'm used to CLR and don't see high BUSCO duplication.
I don't know either. Thanks for pointing out problems. If you can release the data in the future, we will have a look. Overcoming these challenges is important no matter they are caused by HiFi data itself or hifiasm.
Hi,
Very nice tool as easy to run and quick which can be a major headache sometimes. I posted below and just realised I put both smrt cells fasta and fastq in so duplicated the data. I'll try that again and hopefully this helps.
Results look significantly better than HiCanu. I have 10,000 contigs. My problem is the longest contig I have is 3.5Mbp and the BUSCO has high duplication which corresponds to the genome size being 80% too large too. I think I only have x15 coverage, I have requested more data and might be able to push it to 20x but will have to wait. Is there a parameter that I could tweak to get this duplication down rather than use additional tool to do that. It is early days, and I need to play around more and read more but in case I am missing anything obvious. C:99.9%[S:19.7%,D:80.2%],F:0.1%,M:-0.0%,n:1658