dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

Querying with presketches: 'std::bad_alloc' #51

Open mihkelvaher opened 3 years ago

mihkelvaher commented 3 years ago

Hi!

I'm using the same references (-F) multiple times and thought it would be faster if I'd sketch them once and use only the sketches in the future for querying.

Sketching: dashing sketch -F references.fasta_paths.txt -k 32 -p 2 --sketch-size 20 --use-bb-minhash

Querying: dashing dist -F references.sketch_paths.txt -k 32 -p 2 --sizes --sketch-size 20 --use-bb-minhash -T -Q testdata_path.txt --presketched

Query error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

Removing the -Q testdata_path.txt --presketched outputs square matrix without errors. Dashing version: v0.5-3-g03c10

Minor comments/questions: 1) Is the missing "#Names" line intentional in query output? Asking because it is present in the square matrix and getting the order from the "sizes section/file" seems a bit odd. 2) Is there any way to specify sketch output dir/name? -o seems to put them all into a single file. 3) dashing -v outputs the version twice. Probably one time as always, the other as a response to the command.

dnbaker commented 3 years ago

Hi! Thanks for making the issue.

  1. For what you're trying to do, you want to use the --cache-sketches/-W option; this will cache a sketch adjacent to the input filename (e.g., something like input1.fq.s10.hll for input1.fq, input2.fq.s10.hll for input2.fq...). The --presketched-only option treats the filenames as binary files containing sketches. Enabling the option there meant that dashing was trying to load a binary HLL sketch from the input fastx files, which meant it didn't work.

  2. The sketching option (without -o) puts sketches adjacent to the fastx files (as in 1). You can specify a prefix -P/--prefix, which prepends a prefix to the path where sketches are written, which I've used to put sketches into a specific folder.

tl;dr: If you want to use --presketched-only, sketch the files, create a file consisting of paths to the sketches, and then run your second command using that file.

If you just want to avoid re-sketching each time you run, use -W/--cache-sketches

Minor comments:

  1. Names are emitted in the -o output and in the -O output; the first is the distance table and the latter is names + cardinalities of the input sequences, so I think you may just want to be setting the -O parameter. See below for an example run.
  2. See above RE: --prefix
  3. This is correct; having added the version to all invocations, the -v option does this twice. We'll remove this in the next release.

Feel free to ask if you have any more questions, and thanks!

Daniel

$ ./dashing dist -F fnames.txt -Q fnames.txt -o table -O sizes
Dashing version: v0.5-3-g8b24
$ cat table
#Path   Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2368528
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2368528
$ cat sizes
##Names bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  1.000000    0.000000    0.000000    0.000000
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   0.000000    1.000000    0.000000    0.000000
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  0.000000    0.000000    1.000000    0.550403
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   0.000000    0.000000    0.550403    1.000000
mihkelvaher commented 3 years ago

Hi!

Presketching Looking at the code and the help, the flag is --presketched and not --presketched-only (name of the var in code)? Did I understand correctly that --presketched is meant to be used on the single file (mentioned also above) that is outputted with -o while doing dashing sketch ... -o single_file_containing_many_sketches?

Caching This seems to be the easiest solution at the moment. Trying it out, caching once and removing the fastas works too. Also: the short -W doesn't seem to do anything whereas the long --cache-sketches deposits the sketches into the fasta dir as expected.

Missing "#Names" They were missing because I was using -T while querying, which doesn't make sense come to think of it. Removed it and all good.

All the best, Mihkel

dnbaker commented 3 years ago

Hi Mihkel,

Sorry for making you wait.

You're correct, it is --presketched, not presketched-only. Presketched means that the files themselves contain one sketch per file. The dist_by_seq command is for performing distance calculations from a file containing a number of sketches, which could have been created by sketching with the -o parameter, or by sketching each sequence in a file separately with sketch_by_seq.

Unfortunately, what to do with sequence names and metadata for both approaches isn't intuitively obvious for me.

Thanks for the report, and good luck!

Daniel

mihkelvaher commented 3 years ago

Hi!

Maybe it's a WIP but I noticed a commit (or two) mentioning -W and --cache-sketches. While the longer version works, the shorter outputs

Dashing version: v0.5-5-g5210
terminate called after throwing an instance of 'std::bad_alloc'
terminate called recursively
  what():  std::bad_alloc
Aborted

Mihkel

dnbaker commented 3 years ago

Thank you! I was trying to address the cache-sketches issue, but modified the wrong variable. It should be fixed now in both master and dev.

mihkelvaher commented 3 years ago

I can confirm that the -W works now. Is the ##Names row intentionally removed in v0.5-5-g5210 while querying with -Q?