dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

sketch size (-S, with capital S?) #82

Closed gaboentropy closed 2 years ago

gaboentropy commented 2 years ago

Hi Dan,

I've been testing dashing, I thought I was using the sketch size of 14 as per the article. Now, the article lists -s (lowercase s), as the option for the sketch size, but the instructions show -S for sketch size (capital S), and -s for spacer. So I've used -S so far. However, today I thought of presketching, which produces output files automatically named:

The command: dashing sketch -k 21 --sketch-size 14 FNA/*fna*

the files ending: .w.21.spacing.14.hll

So, it makes it look as is -S is for spacing, not for sketch size. I tried the long format: --sketch-size But it gave me the same ending.

I therefore tried -s (lowercase), and it gives me a segmentation fault.

So, how do I control sketch size?

:(

P.S. The "w" also worries me, since I use -k to set the kmer length, but w seems to mean window size ...

(running dashing version: v1.draft-3-g90f0 under macosx intel processor)

dnbaker commented 2 years ago

Hi,

You're right - there's an error in the usage. The spacing argument is "-s", not "-S", while "-S" sets the sketch size. The "spacing" in the name (if just the word spacing) means that there is no spacing happening - otherwise, there would be a string specifying the number of spaces between characters used in the spaced seeds. You can safely ignore that portion of the name. It is included so that Dashing will differentiate sketches made with different spacing schemes so you can do analyses with different spacings without accidentally using the wrong spacing scheme.

k determines how many characters are in a seed. If you want contiguous seeds (default and typical behavior), then it just uses contiguous k-mers from the reads.

-w determines the window size over which to collect k-mers; Each window of size w > k yields w - k + 1 k-mers, and Dashing will select a k-mer from that window with the minimum hash value. Basically, you only want to use windows if you want to minimize the sequences and only process 1 of them. You probably won't want to use that unless you're trying to speed up parsing large read sets.

And the "spacing" argument means that you can specify a "spaced seed" pattern listing which characters you can ignore. After each character you use, you list how many characters to ignore, so you make a list that is k - 1 lengths long to account for k characters, since there are only k - 1 spacings between them. For instance,

0,1,0,1,0,1,0,1,0,1, which has a total of 10 spacings corresponds to 11 characters in the seed. KK$KK$KK$KK$KK$K describes the k-mer pattern this would correspond to, which means that the places where $ is present would be ignored and places where the K is would be kept.

But for your case, you want to just use upper-case S to specify sketch size (e.g., -S 14 to use 2^14 = 16384 registers), you probably don't want to do anything with the window size -w unless you're working with sequencing datasets, and you can safely ignore the "spacing" in the cached file-path - it isn't being spaced.

Thanks for asking, and let me know if you need any further help!

Best,

Daniel

gaboentropy commented 2 years ago

Thanks Daniel!