dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Output tsv from `setdist` (`setdist -T` isn't working) #18

Closed olgabot closed 5 years ago

olgabot commented 5 years ago

Hello! Thanks again for adding the output TSV feature for dist. The same command doesn't seem to be working for me for setdist. Is it possible to add the same feature? Thank you! Warmest, Olga

dist -T works fine

(kmer-hashing)
 ✘  Wed 13 Feb - 08:08  ~/rcfiles   origin ☊ master ✔ 
  time /home/olga/code/dashing/dashing dist \
    -T -b -O \
    /home/olga/code/kmer-hashing/data/100_test_dashing/dashing_dist_k21_sketch10.tsv \
    -k 21 \
    /mnt/pureScratch/olga/dashing-test/catted_reads/*.fastq.gz
#Path   Size (est.)
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA000577-3_8_M-1-1.fastq.gz        45941141.617559
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA100039-3_11_M-1-1.fastq.gz        28812026.843665
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-B000971-3_39_F-1-1.fastq.gz 17248371.587632
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-D041914-3_8_M-1-1.fastq.gz  19983749.392062
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-B001717-3_38_F-1-1.fastq.gz 16453228.911596
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA000487-3_10_M-1-1.fastq.gz        26391118.812770
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-B002775-3_39_F-1-1.fastq.gz 13471740.831427
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-D041914-3_8_M-1-1.fastq.gz  18346446.763078
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-D041914-3_8_M-1-1.fastq.gz  17065167.466045
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA100041-3_9_M-1-1.fastq.gz        17565493.832883
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA100140-3_57_F-1-1.fastq.gz       8458176.796920
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-MAA000559-3_8_M-1-1.fastq.gz        13580440.404920
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-D042253-3_9_M-1-1.fastq.gz  18532120.609711
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA000559-3_8_M-1-1.fastq.gz        11405142.776925
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-MAA000559-3_8_M-1-1.fastq.gz        12071181.284648
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-B002764-3_38_F-1-1.fastq.gz  4868938.990295
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-D042253-3_9_M-1-1.fastq.gz   12064152.714943
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA000779-3_11_M-1-1.fastq.gz        6605897.783324
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-MAA000508-3_9_M-1-1.fastq.gz        6819225.271006
/home/olga/code/dashing/dashing dist -T -b -O  -k 21   106.14s user 1.10s system 99% cpu 1:47.60 total

setdist -T errors out

 Tue 12 Feb - 18:53  ~/rcfiles   origin ☊ master ✔ 
  time /home/olga/code/dashing/dashing setdist \
    -T -b -O \
    /home/olga/code/kmer-hashing/data/100_test_dashing/dashing_setdist_k21_sketch10.tsv \
    -k 21 \
    /mnt/pureScratch/olga/dashing-test/catted_reads/*.fastq.gz
setdist: invalid option -- 'T'
Usage: setdist <opts> [genomes if not provided from a file with -F]
Flags:
-h/-?   Usage
-k      Set kmer size [31]
-W      Cache sketches/use cached sketches
-p      Set number of threads [1]
-b      Emit distances in binary (default: human-readable, upper-triangular)
-U      Emit distances in PHYLIP upper triangular format(default: human-readable, upper-triangular)
-s      add a spacer of the format <int>x<int>,<int>x<int>,..., where the first integer corresponds to the space between bases repeated the second integer number of times
-w      Set window size [max(size of spaced kmer, [parameter])]
-S      Set sketch size [10, for 2**10 bytes each]
-H      Treat provided paths as pre-made sketches.
-C      Do not canonicalize. [Default: canonicalize]
-P      Set prefix for sketch file locations [empty]
-x      Set suffix in sketch file names [empty]
-o      Output for genome size estimates [stdout]
-I      Use Ertl's Improved Estimator
-E      Use Ertl's Original Estimator
-J      Use Ertl's JMLE Estimator [default      Uses Ertl-MLE]
-O      Output for genome distance matrix [stdout]
-L      Clamp estimates below expected variance to 0. [Default: do not clamp]
-e      Emit in scientific notation
-f      Report results as float. (Only important for binary format.) This halves the memory footprint at the cost of precision loss.
-F      Get paths to genomes from file rather than positional arguments
-M      Emit Mash distance (default: jaccard index)
-T      postprocess binary format to human-readable TSV (not upper triangular)
-Z      Emit genome sizes (default: jaccard index)
-N      Autodetect fastq or fasta data by filename (.fq or .fastq within filename).
-y      Filter all input data by count-min sketch.
-q      Set count-min number of hashes. Default: [4]
-c      Set minimum count for kmers to pass count-min filtering.
-t      Set count-min sketch size (log2). Default: ceil(log2(max_filesize)) + 2
-R      Set seed for seeds for count-min sketches
/home/olga/code/dashing/dashing setdist -T -b -O  -k 21   0.00s user 0.00s system 0% cpu 0.004 total
dnbaker commented 5 years ago

Hi Olga,

Thanks for bringing this to my attention!

I've provided support for the -T option for setdist as of https://github.com/dnbaker/dashing/commit/b6503e786091e4a1e344356c21aa5b5df5a99c40.

I've been lazy about updating setdist's options, as I expected it to be primarily used by me for experiments because of the computational cost, but I really need to bring it to feature parity with the dist subcommand.

Thanks!

olgabot commented 5 years ago

You're welcome! I'm doing my own benchmarking with sourmash so having the true distance is super helpful for me.

olgabot commented 5 years ago

Always glad to be the "canary in the coal mine" :)

dnbaker commented 5 years ago

There seems to be some bug introduced by a new feature I added in that commit, so I rolled it back in the latest commit. And I'm happy to help.

dnbaker commented 5 years ago

It's all been fixed. Phew.