Dashing2 is the second version of the Dashing sequence sketching and comparison system.
There have been several major changes, but you can still can get a quick start to compare a group of sequence collections here. For instructions on parsing, see Parsing below. Installation instructions can be found below.
tl;dr --
Input Formats -- See inputs below for details.
--protein
)--long-kmers
)--window-size
to be > k--spacing
option, which provides the number of ignored characters in between used characters for seeds.--countdict
)--bagminhash
or --multiset
), and ProbMinHash for discrete probability distributions (--prob
)--seq
, a sequence of minimizer values are emitted as a string.--leafcutter
--bigwig
) [Weighted Interval Set]--bed
)There's also a two-step process; we can use the LSH index to generate probable candidates, but switch to exact distance comparisons after refinement.
This can be enabled with the --refine-exact
flag when using filtered exact-set (--set
) or exact-multiset (--countdict
).
For sketching algorithms that are not OrderMinHash (whether the default SetSketch, BagMinHash or ProbMinHash), this means that candidates are generated with fast, compressed sketches, but final results are produced using the full hash values.
If --edit-distance
has been enabled, --refine-exact
and --compute-edit-distance
cause candidates to be generated by LSH table querying, but whose final distances are computed by exact edit ditsance.
This allows one to prune the edit distances with LSH but still get exact results out.
Outputs
(N (N - 1)) / 2
.--qfile/-Q
against --ffile/-F
) by providing query/reference sets--similarity-threshold
) and top-k results (--topk <k>
)
--asymmetric-all-pairs
), which performs c(A, B) for all-pairs, resulting in an NxN matrix.All of these can be emitted in binary format --binary-output
to avoid parsing/print formatting costs.
dashing2 cmp --help
for more details.
Dashing2 sketches input datasets, and then compares resulting sketches.
The formats supported are:
• FastA and FastQ
By default, we do set sketching (One-Permutation MinHash or SetSketch Minhash), but we can do multiset (BagMinHash) or discrete prob set (ProbMinHash) sketching. BagMinhash is probably most appropriate for genome assemblies (--multiset), and ProbMinhash (--prob) is most appropriate for splicing and expression datasets, due to normalization.
We also have untested support of edit distance LSH. It seems to work, but hasn't been carefully vetted.
Counting is necessary for both BagMinHash and ProbMinHash. Default behavior, which works with sets and discards quantities, does not require it.
For counting, we either use a (single-row) CountSketch to approximate weighted set comparisons or we compute exact counts using a hash table. Counting is exact by default.
To enable count-sketch approximated counting, use the flag --countsketch-size [size]. The larger this parameter, the closer to exact the weighted sketching will be.
In addition to creating sketches for m-mer sets, Dashing2 can perform full m-mer sets and m-mer count dictionaries. This can be enabled with --set or --countdict, respectively. This will be slower, but exact.
Default normalization treats each base pair in an interval as equally weighted. For SPACE_SET, this doesn't matter, but for weighted sketching, it does.
That is, an interval from 0:100-200 will have twice the weight as 0:100-150. If, instead, you want to treat each interval as weighing the same as another interval, enable --normalize-intervals.
Similar to BED, except that the normalize flag causes the splicing event weight to be the fraction of reads supporting the junction rather than the absolute number of reads.
IE, 3/5 would have weight 0.6 if --normalize-intervals is enabled and weight 3 otherwise.
Sketches can be: SetSketch (One-permutation minhash or SetSketch) BagMinHash ProbMinHash OrderMinHash
However, OrderMinHash is only available for sequences.
When handling multiplicities, you can use exact counting, which may be slower, or you can approximate the count vectors with a single-row count-sketch by setting opts.cssize_ > 0.
ProbMinHash is usually significantly (2-20+x) faster than BagMinHash, although multiset jaccard may be more appropriate for some problems. For instance, expression and chromatic accessibility might be better considered discrete probability distributions and therefore fit ProbMinHash, whereas genomic sequences better match the multiset concept and benefit from BagMinHash, which is a MinHash algorithm for weighted sets.
We expect most usage to involve the sketch
subcommand; to sketch and perform distances, add --cmpout <file>
, where
Use 1 -- Sketch + All-pairs Compare Sequence Collections
dashing2 sketch [options] --cmpout <outfile> genome1.fa genome2.fa <...>
# Alternate, if filenames are in F.txt
# dashing2 sketch [options] --cmpout <outfile> -F F.txt
Full usage is found via dashing2 --help
and dashing2 <subcommand> --help
, where
dashing2 sketch
performs sketching/summarization of a set of input files, or sequence-by-sequence processing of one or more sequence files.
It also optionally performs comparisons and emits results to --cmpout
.
Adding --cache
causes Dashing2 to cache sketches to disk adjacent to the input files;
this location can be changed with --outprefix
.
We support a variety of alphabets -- DNA, Protein, and reduced amino acid alphabets for long-range homology (--protein14, --protein8, --protein6).
Alternate file-types supported include BigWigs (--bigwig
), BED (--bed
), and LeafCutter outputs (--leafcutter
).
Clustering: scipy.cluster.hierarchy and fastcluster.hierarchy yield fast, concise clusterings, if distances are emitted.
To perform query-set vs reference-set comparison, see -Q/--qfile
usage -- this yields a full rectangular matrix.
This is particularly useful for asymmetric similarities, such as containment.
Use 2 -- Sketch + Top-k NN graphs
For many applications, the quadratic space and time complexity of pairwise comparisons is prohibitive; we use locality-sensitive hashing (LSH) table to only perform comparisons between candidates highly probable to be nearest neighbors.
To generate a K-NN graph between a collection of sequences, with k = 250:
dashing2 sketch <comparison options...> --cmpout <outfile> --topk 250 -F F.txt
The output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.
Clustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.
Use 3 -- Sketch + Jaccard-thresholded similarity graphs
Alternative to --topk [k]
, once can select a Jaccard similarity threshold below which the algorithm can ignore.
dashing2 sketch <comparison options...> --cmpout <outfile> --similarity-threshold 0.7 -F F.txt
The output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.
Clustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.
Use 4 -- Protein sequence similarity search
The feature --parse-by-seq
allows us to sketch and compute similarities between collections of sequences in a single file;
in particular, this is useful for sequence files of protein sequences.
The default similarity/distance is k-mer set comparisons in sketch space; however, edit distance is perhaps more useful for protein sequences. Also, sketch sizes for proteins should be substantially smaller than for a full genome sequence.
dashing2 sketch -S256 --cmpout prot.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa
If --edit-distance
is enabled, OrderMinHash (Guillaume Marcais, et al.) is used to build an LSH table, and comparisons are generated between OrderMinHash sketches;
this is only allowed in parse-by-seq mode, as it is undefined on a collection of sequences as opposed to a single sequence.
dashing2 sketch --topk 25 --edit-distance --compute-edit-distance -S256 --cmpout prot.top25.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa
If either --refine-exact
and --compute-edit-distance
is enabled, then the final distances will be computed via edit distance.
This is particularly useful if the result is Jaccard or top-k thresholded, allowing linear-time top-k nearest neighbor lists.
Use 5: weighted sketching of feature sets
There's also dashing2 wsketch
, which can be used for hashing weighted sets for comparisons.
This is for the case where there are a set of integral identifiers for sketching.
See dashing2 wsketch --help
for usage and examples.
Use 6: Grouping sequences by edit distance
To use OrderMinHash, parsing must by by sequence. Further, --edit-distance tells Dashing2 to use edit distance LSH. --compute-edit-distance instructs dashing2 to compute edit distance between candidate neighbors instead of only comparing the sketches (the LSH register values themselves).
Use this to generate K-Nearest Neighbor graphs for edit distance.
dashing2 sketch -p8 --cmpout knn.edit-distance.tbl -k7 --parse-by-seq --edit-distance --compute-edit-distance input.fasta
Use 7: Generate KNN graph using exact K-mer distances but using LSH pre-filtering.
You can do this for exact k-mer sets:
dashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31 -S1024 --set --topk 25 -o input_sequence_set.topk.tsv
Or for exact weighted k-mer sets:
dashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31 -S1024 --countdict --topk 25 -o input_sequence_set.topk.tsv
For these approaches, Dashing2 will use a bottom-k LSH index to generate candidates.
You can instead generate a list of neighbors filtered by similarity threshold rather than choosing its top-k nearest neighbors.
Simply replace --topk <int>
with --similarity-threshold <float>
.
The easiest way to get started is to download a statically-linked binary in the dashing2-binaries repo. There are different folders for osx and linux, so select the latest version from the folder corresponding to your operating system. Dashing2 has not been tested on Windows, and we do not support it.
Alternatively, you can build from source with git clone --recursive https://github.com/dnbaker/dashing2 && cd dashing2 && make -j4
.
Dashing2 now requires C++20, and therefore needs a relatively recent compiler, but the binary will be smaller than the statically-linked options provided and the code may be more directly tailored to your architecture.
dashing2
uses 32-bit hash identifiers in LSH tables for speed and memory efficiency.
To use more than 4.3 billion, use dashing2-64
, which switched to 64-bit identifiers and hashes.The default version of Dashing2 is dashing2, which uses 32-bit LSH keys and ID types in its NN tables; this is faster and more memory-efficient, but less specific;
D2_CACHE_SIZE
environment variable.# set cache size to 64 MB
export D2_CACHE_SIZE=67108864
As an aside, these sketches are stored contiguously to reduce fragmentation compared to Dashing1.
For parsing binary output (--binary-output/--emit-binary), we provide Python code for parsing into NumPy/SciPy matrices. This can save on processing time, and it avoids differences due to formatting/parsing loss.