dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Add fasta-format de-duplication in --parse-by-seq mode #37

Closed dnbaker closed 2 years ago

dnbaker commented 2 years ago

In order to convert a larger sequence file into a de-duplicated fasta file, you can now append F to the --greedy argument.

For example: --parse-by-seq --greedy 0.8F will perform sequence deduplication on an input file with 80% similarity threshold, and emit a fasta file as output. Omitting the F will yield only the sequence names.