dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

Output file name when empty sketch error #53

Open mihkelvaher opened 3 years ago

mihkelvaher commented 3 years ago

Hi!

A small request to help with debugging.

When a file does not exist when using -F/-Q Dashing outputs the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not densify empty sketch

It would be very useful to know which file caused the error. I'm currently facing a situation of finding the bad file among thousands of files.

Edit: Turned out for this case, assuming that the whole dir consists of fastas and using find "$FASTA_DIR_PATH" -type f > $FASTA_PATHS, was not a good idea while using --cache-sketches because both fastas and sketches got written into the -F input file and Dashing was trying to create a sketch from a sketch file which produced an empty sketch.

All the best, Mihkel

dnbaker commented 3 years ago

Hi Mihkel,

Thanks for the request! I think that's absolutely useful. I've pushed a patch to this branch (https://github.com/dnbaker/dashing/tree/dependency), which should emit the error message. However, the error is being caused by the fact that my sketch library throws an error when 'densifying' an empty k-partition minhash sketch, which I have also updated here (https://github.com/dnbaker/sketch/pull/19). Instead, it will fill the sketch with sentinel values, which will cause empty sketches to compare equal and non-empty sketches to compare fully dissimilar to empty ones. I think that's a more graceful way to handle the edge case.

Thanks again,

Daniel

mihkelvaher commented 3 years ago

Hi!

After trying out the latest version and mulling it over in the light of the issue in the beginning, I have to disagree that this is the best approach in terms of UX.

I could come up with two cases where empty sketches could appear: 1) All of the sequences in the fastx file are < k 2) Provided sequences are not valid fastx files.

For the first case, empty sketches are fine because there simply isn't a way to get any k-mers. For the second, creating empty sketches while the input file doesn't even contain sequences leads to headaches downstream with files like README.md, assembly_summary.txt etc.

I don't know whether it would be better to exit with an error or simply ignore the invalid files. The former makes the work process slower but more accurate, the latter is more useful when you've been days in the HPC queue and it doesn't break immediately. Currently the breaking occurs when the file does not exist.

Some possibly confusing results I noticed while testing with 4 files and --sizes. Two genomes, one previously cached sketch and "nonsense" containing just asd:

#Path   Size (est.)
E33L.fna    5736414
ATCC.fna    5326886
ATCC.fna.w.32.spacing.19.bmh    524288
nonsense    524288
##Names E33L.fna    ATCC.fna    ATCC.fna.w.32.spacing.19.bmh    nonsense
E33L.fna    -   1.35523e+06 0   0
ATCC.fna    -   -   0   0
ATCC.fna.w.32.spacing.19.bmh    -   -   -   524288
nonsense    -   -   -   -

The sketch and nonsense both have the same size and intersection shows they're identical. In terms of couldn't-get-any-k-mers they're identical but how is the size calculated?

Again, this is a UX question while everything works so things could be left as they are.

Some other comments: 1) It would be nice to have the file name in the warning Could not densify empty sketch; setting all minimizers to empty value. It will compare completely equal to all empty sketches and full dissimilar to all others. Currently it is printed for each empty sketch. 2) Using --cache-sketches the warnings are not displayed when using previously already made cached sketched, but this is to be expected.

Sorry for the long post but I hope it's worth something.

Edit: https://github.com/dnbaker/dashing/tree/dependency is a 404.

All the best, Mihkel

dnbaker commented 3 years ago

https://github.com/dnbaker/dashing/commit/5210548e6ba96deb21963e7e2b3df735a1542e1a

I merged that branch into master, sorry. Give that a shot?

And yes, the cardinality will be garbage due to what the default (un-filled) value is. I think I should add some documentation about the behavior, but I do think that's better than throwing an error.

dnbaker commented 3 years ago

Following up on this, I've changed the behavior to compute a cardinality estimate of 0 for an empty sketch, and incorporated this into the dev branch.

Does this help solve your issue? I'll work on preparing a release with this raft of changes.