While the standard preferred method of generation is to use full Continuous SetSketch registers and then optimally tune the a and b parameters for compression from the range of the data, we also now support methods which directly generate the SetSketch sketches.
This is currently only supported in file-level processing. (IE, it is not supported in --parse-by-seq mode.)
Most generally, this can be done by:
Setting --fastcmp V /--regbytes V, where V is the number of bytes per register. (Only 1, 2, and 4 are supported.)
Setting --setsketch-ab A,B, where A is the offset and B is the log base.
However, it's probably easier to use one of our preset methods, where you can choose between bytes, shorts, and words.
--fastcmp-bytes uses a = 20, b = 1.2, and regbytes = 1.
--fastcmp-shorts uses a = .06, b = 1.0005 and regbytes = 2.
--fastcmp-words uses a = 19.77, b = 1 + 1.097235e-08, and regbytes = 4.
These won't be maximally tuned for your collections, but they should be accurate, and, importantly, the peak memory required by your program should be substantially reduced.
While the standard preferred method of generation is to use full Continuous SetSketch registers and then optimally tune the a and b parameters for compression from the range of the data, we also now support methods which directly generate the SetSketch sketches.
This is currently only supported in file-level processing. (IE, it is not supported in
--parse-by-seq
mode.)Most generally, this can be done by:
However, it's probably easier to use one of our preset methods, where you can choose between bytes, shorts, and words.
--fastcmp-bytes
uses a = 20, b = 1.2, and regbytes = 1.--fastcmp-shorts
uses a = .06, b = 1.0005 and regbytes = 2.--fastcmp-words
uses a = 19.77, b = 1 + 1.097235e-08, and regbytes = 4.These won't be maximally tuned for your collections, but they should be accurate, and, importantly, the peak memory required by your program should be substantially reduced.