Fix typo and approximate number of markers

AustinHartman commented 3 years ago

The change to the approximate number of markers better lines up with what I see in the logs. eg:

Selected 104646 10-mers as markers out of 1048576 total.
Requested inclusion probability: 0.1.
Actual fraction of marker k-mers: 0.0997982.

paoloczi commented 3 years ago

In the early development phases of Shasta, Shasta used raw sequence, not RLE sequence like now. The message

Selected 104646 10-mers as markers out of 1048576 total.

Reports the total number of k-mers selected as markers, counting all k-mers, without regards to whether they are RLE k-mers or not. In reality, only a subset of those, the ones that have no repeated bases, are actually used as markers, because the ones with repeated bases never appear in the RLE sequence of the reads, by construction.

The total number of k-mers of length k is 4^k. The number of RLE k-mers of the same length is much lower, 4×3^k-1, because at all positions other than the first you have only 3 choices, not 4.

For k=10, the total number of k-mers is 1048576, but the number of RLE k-mers is 78732. If we chose 10% of all k-mers, we end up with about 104000 k-mers, of which 7900 are RLE k-mers and are actually used in the assembly. Therefore the documentation is correct, but the message is incorrect/misleading.

I apologize for the confusion. I should improve that message. As a partial saving grace, if you look in the log output a couple of lines after that message you see the following:

The above statistics include all k-mers, not just those present in run-length encoded sequence.

AustinHartman commented 3 years ago

Thank you for the detailed response. I added a couple of sentences to clarify the number given in the docs.

paoloczi commented 3 years ago

Thank you!

chanzuckerberg / shasta

Fix typo and approximate number of markers #238