Closed mmokrejs closed 3 years ago
Hello,
The -C
parameter is not actually related to your read length at all, but it instead controls the length of pre-computed k-mers from the BWT. If you set it to 10, it will precompute the locations of all 10-mers. This is approximately 6^10 different k-mers and generally takes about 1GB of memory, it also does this relatively quickly in terms of run-time. Setting it lower will use less memory but have longer downstream query times, setting it higher will use more memory but save query time downstream.
For example, setting it to 11-mers, would use about 6GB and probably ~1 minute of pre-compute, 12 is 36GB and pre-compute in 6 minutes, etc. It will continue to grow exponentially and is not really practical beyond a certain point (probably in the 15-20 range, but even then you need lots of memory and the pre-compute becomes quite expensive). In practice, 8 is a good default because memory and run-time are negligible for basically every use case. We typically use 10 because 1GB of memory is relatively negligible for us, and the relatively small compute time is worth it for downstream acceleration.
In most use cases, I would recommend you use C<=10. Does this answer your question? Is there some other parameter you're trying to tune?
I thought to understand from the README.md
that larger is better, so I asked for 400GB of RAM. But it seems there is no advantage to derive the larger k-mers from the index? Would you mind improving the README.md
to make it clearer why I should not even go for -C 100
? Why not? I don't mind longer build time if the results are better. If tehy won't be any better then I am missing something. OK, this is a cache but still, ...
Anyway, I think the app should not crash.
When you pass -C 100
, you're asking it to pre-compute 6^100 k-mers which is a truly astronomical number and several orders of magnitude more than any existing dataset. (For example, 6^100 is approximately 10^77, this is close to the estimation of the number of atoms in the universe). The reason the app is crashing is because the computer cannot allocate that much memory, and then fails.
For your use case, I just recommend you start with C=10. Then, if you want, you can increase it to see at what point you reach an optimal practical run-time for your datasets.
Hi, seems the
-C
parameter is related to read length. But reads were 125nt in length but even with-C 100
it crashes.-C 10
is fine. But why 10-mers? Is there any use for them? I use current git master checkout.