esolares / HapSolo

Reduction of Althaps and Duplicate Contigs for Improved Hi-C Scaffolding
GNU General Public License v2.0
19 stars 6 forks source link

Many contigs without BUSCOs - BUSCO bias #5

Closed mickey-spongebob closed 3 years ago

mickey-spongebob commented 3 years ago

Hi @esolares

Thank you for the nice tool! I was wondering if it is still advisable to use this tool if many contigs in the draft assembly do not have BUSCO genes. If the purging algorithm is only for conntigs that had a BUSCO gene, then while the final result may look nice, there may well still be many contigs that are duplicates but because they didn't have a BUSCO gene, they are ignored. Did I understand the tool correctly?

For example, my genome assembly is currently 2,586 contigs with 99% complete genes but 38% duplicates. The BUSCO dataset I intend to use for the HapSolo analysis has ~300 genes. Therefore many contigs will not have a BUSCO gene. Will these contigs be ignored in the downstream purging algorithm?

Thank you so much and wishing you well!

Cheers

esolares commented 3 years ago

Hi,

Thank you for your kind words. So HapSolo uses contigs with BUSCO's to identify good values of %Identity, % of the query alignment and the ratio of % of the query aligned and the % of the reference aligned. It searches for the lowest score in order to train the purging filter. Once it has found an "optimal" solution for those parameters it then classifies which contigs are althaps and which are primary contigs. Of course the program isn't perfect but what is important is that you have a high number of duplicate contigs. If you want to run HapSolo quickly, I recommend using minimap2.

In your case you have 38% duplicates, so I would recommend you run HapSolo. You will have to run BUSCO on each individual contig though. Running it on the whole assembly for some reason identifies fewer BUSCO's.

So in short, yes the contigs are ignored for optimizing parameters, and no they are not ignored when they are purged, if they fall within the thresholds of the "optimized" parameters.

Thank you and you are welcome. Hope you are also doing well.

Thank you,

Edwin Solares, M.S. UC President's Fellow PhD Candidate in Comparative Genomics and Evolutionary Biology Department of Ecology and Evolutionary Biology Gaut Lab 5438 McGaugh Hall University of California, Irvine Irvine, CA 92697 USA

On Mon, Mar 29, 2021 at 11:14 AM mickey-spongebob @.***> wrote:

Hi @esolares https://github.com/esolares

Thank you for the nice tool! I was wondering if it is still advisable to use this tool if many contigs in the draft assembly do not have BUSCO genes. If the purging algorithm is only for conntigs that had a BUSCO gene, then while the final result may look nice, there may well still be many contigs that are duplicates but because they didn't have a BUSCO gene, they are ignored. Did I understand the tool correctly?

For example, my genome assembly is currently 2,586 contigs with 99% complete genes but 38% duplicates. The BUSCO dataset I intend to use for the HapSolo analysis has ~300 genes. Therefore many contigs will not have a BUSCO gene. Will these contigs be ignored in the downstream purging algorithm?

Thank you so much and wishing you well!

Cheers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPTGQEVC3K3SY67FYR3TGDGKTANCNFSM42AC3YFQ .

mickey-spongebob commented 3 years ago

Super cool!! Thank you so much :-)

PS, I'm currently running using BLAT and will soon run BUSCOs on each individual preprocessed contig but thank you for the advice.

Also I have already ran purge_dups and purge_haplotigs to reduce the duplication rate from ~80% to 38% so it's nice that HapSolo uses a different approach. I'm hoping it works well :-)

Best wishes

esolares commented 3 years ago

Sounds great. I haven't tried running the programs together yet. Might be good to run HapSolo independently and one with both to see how they differ.

I'll soon be releasing hapsolo with all dependecies on singularity. Hopefully in the next few days.

I'm currently testing it.

Thank you,

Edwin

On Mon, Mar 29, 2021, 11:38 AM mickey-spongebob @.***> wrote:

Super cool!! Thank you so much :-)

PS, I'm currently running using BLAT and will soon run BUSCOs on each individual preprocessed contig but thank you for the advice.

Also I have already ran purge_dups and purge_haplotigs to reduce the duplication rate from ~80% to 38% so it's nice that HapSolo uses a different approach. I'm hoping it works well :-)

Best wishes

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/5#issuecomment-809616262, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPUDZVCKNHPTTXZVKULTGDJDPANCNFSM42AC3YFQ .