dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Add direct setsketch generation options #38

Closed dnbaker closed 2 years ago

dnbaker commented 2 years ago

While the standard preferred method of generation is to use full Continuous SetSketch registers and then optimally tune the a and b parameters for compression from the range of the data, we also now support methods which directly generate the SetSketch sketches.

This is currently only supported in file-level processing. (IE, it is not supported in --parse-by-seq mode.)

Most generally, this can be done by:

  1. Setting --fastcmp V /--regbytes V, where V is the number of bytes per register. (Only 1, 2, and 4 are supported.)
  2. Setting --setsketch-ab A,B, where A is the offset and B is the log base.

However, it's probably easier to use one of our preset methods, where you can choose between bytes, shorts, and words.

--fastcmp-bytes uses a = 20, b = 1.2, and regbytes = 1. --fastcmp-shorts uses a = .06, b = 1.0005 and regbytes = 2. --fastcmp-words uses a = 19.77, b = 1 + 1.097235e-08, and regbytes = 4.

These won't be maximally tuned for your collections, but they should be accurate, and, importantly, the peak memory required by your program should be substantially reduced.

lgtm-com[bot] commented 2 years ago

This pull request fixes 1 alert when merging 75816c8306d108c3b301636669865ad6aebd4e37 into 42b123d93410edd883fca84741e647a9aa1a7f61 - view on LGTM.com

fixed alerts: