gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
95 stars 7 forks source link

DNAapler circularization not included in final output #93

Open pabloati opened 1 week ago

pabloati commented 1 week ago

Hello Geroge, such a great tool you have in hybracter!!

I have been using it to assemble some ONT only genomes, without using medaka, and I have seen that even though dnaapler is run on the chromosome, it is then not included in the final report. When hybracter does the merge between chromosomes and plasmids, it only takes the non-circularized chromosome.

I saw that you use the same script as with medaka, and in the step of the selection, it just takes the genome from pre_polish, rather than from reoriented. Also, would it also make sense to include the plasmids in the recircularization? As it is of now, only the chromosomes are recircularized.

Additionally, I believe that flye can perform better assemblies if all the reads are given to it and the subsampling is specified in its own settings, rather than with Filtlong. However, I am still testing this situation with some genomes that I have, to take into account not only the structure of the assembly, but also the final quality.

Best, Pablo

gbouras13 commented 1 week ago

Hi @pabloati ,

Thanks for your kind words mate. To answer your questions/comments:

1) Can you provide me an example? Dnaapler will be run if the 'chromosome' is circular. If the 'chromosome' (contig above a the -c value) is not circular, it won't be run. This either means your bacteria has a linear chromosome, or the assembly probably didn't fully circularise. I did it this way in case you have a bacteria with linear chromosomes, but it is likely you and most people don't, so please be mindful of those. If dnaapler is run, then the reoriented chromosome should be included - would you be able to provide me with an example of one where it isn't run?

With question 2, I am not sure what you mean with the pre_polish, could you point me where? And with plasmids and reorientation, plassembler handles it, so I don't add that to dnaapler (so as not to do it twice).

With your last point, potentially I am sure more depth will be favourable - I'd recommend increasing --subsample_depth from the default of 100 to something crazy high like 1000000 to achieve your goals (it will mean in practice filtlong will do no subsampling).

George

pabloati commented 23 hours ago

Hi George, thanks for your answer,

Regarding dnaapler, it was my own mistake. I thought that we had Hybracter updated to the latest version on our cluster and it wasn't. With 0.8.0 that issue is fixed.

Regarding the subsampling, from my experience, flye also tends to missperform if there is a lot of depth (usually over 200x). Perhaps it is due to having too many "short" reads, and I feel that the read subsetting is better for flye if it is done based on size, but as the subsetting you do in hybracter also goes to the polishing by medaka I get the tradeoff. I guess I will run hybracter normally and try to assemble manually the incomplete genomes.

One last question: The threads option does not seem to limit the number of threads that hybracter uses in my system (this time on version 0.8.0 ;) ). I set it to 20 and the CPU usage went well over that threshold. I looked at your script and I saw that you use the option --jobs, which is similar to --cores, but cores is done for running it locally rather than a cluster. I just changed it and now the usage is within expected. Maybe you could add an option for the user to set it to local or cluster and use --cores or --jobs respectively?

Thanks for everything and best, Pablo