Closed richardstoeckl closed 2 months ago
Hi @richardstoeckl ,
This is a very reasonable request.
Give you have done 90% of the implementation here , I certainly can't say no! - hopefully this week I'll find some time to push an update with this and a couple of other things, including the archaea reorientation :)
George
@richardstoeckl ,
I had a look at this this arvo. Are you getting reasonable 21mer counts for your data? They are a massive overcount for me (even with pretty new Nanopore data). With R9 SUP I get 9.5M for s aureus (should be 2.8M), and with R10.4.1 I get 4.5M for e faecalis (should be 3M). Therefore, I am reticent to include this (especially for old nanopore data!)
Disregard ^^ I wasn't using -ci10 - works fine after I do that :)
George
@richardstoeckl - your code is amazing. Seems to be working out of the box. I'll run it through the CI test suite and merge it - feel free to make a cosmetic PR if you want to be recognised on the repo as a contributor, I'll gladly merge it - thanks again!
George
@richardstoeckl - your code is amazing. Seems to be working out of the box. I'll run it through the CI test suite and merge it - feel free to make a cosmetic PR if you want to be recognised on the repo as a contributor, I'll gladly merge it - thanks again!
George
That would be an honor! Glad to be able to help improve this amazing pipeline :)
Hi George,
I wanted to discuss the option of adding an 'auto' mode to the
-c
parameter.In my opinion, the assembly of novel prokaryotes or at least the assembly of genomes that will be taxonomically classified AFTER they have been assembled, is a common use case of assemblers in general. Therefore, it doesn't surprise me, that both shovill and dragonflye contain a genome size estimation step.
Both shovill and dragonflye use KMC for genome size estimation based on k-mer counting (see Line 136 in shovill and Line 214 in dragonflye). Both use the basic but well known formula to estimate the genome size:
For a given sequence of length L, and a k-mer size of k, the total k-mer’s possible will be given by ( L – k ) + 1
(see this paper or for example this tutorial).A way to implement this in snakemake would be something like this:
I know and appreciate that you want to keep hybracter as simple and stable as possible, and adding an additional dependency like KMC would add complexity. However, I think that this could add a lot of value in terms of being able to automate the assembly process. Also, KMC is actively developed, so any bugs that could cause problems will also be fixed.
Best wishes, Richard