dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

--outprefix option and cmp subcommand are invaild #68

Open XiaomingXu1995 opened 1 year ago

XiaomingXu1995 commented 1 year ago

Hi Daniel, I run the dashing2 with "./dashing2_savx2 sketch -F bacteria.list -S 1024 --threads 48 -o bacteria.sketch" to get the sketches, and the bacteria.sketch and the bacteria.sketch.name.txt are generated. The cached sketch files are saved adjacent to the input file, and I try to specify the directory for the cached files by the option "--outprefix or --prefix", but it does not work. This makes the directory of the original input genome file directory chaotic. Without the option of "--cache", the cached file will be in the input genome directory as well. How can I cancel the cached file?

In addition, I want to use the cmp or dist subcommand to compute the all-vs-all pairwise distances by the bacteria.sketch, but I cannot get the help information of this subcommand by "./dashing2_savx2 dist --help" and do not know how to use it.

Best, Xiaoming

dnbaker commented 1 year ago

Hi Xiaoming -

Thank you for this issue! I'm looking into it. I really appreciate the feedback and I'll let you know when it's fixed up.

Best wishes,

Daniel

dnbaker commented 1 year ago

Checking back in on this - I seem to have fixed this issue on my machine and have updated the main branch accordingly. (See the linked PR.)

You could build from source now; I'm also working on updating binaries, but that won't be done until tomorrow. Please let me know how this works for you.

Thanks again!

Daniel

XiaomingXu1995 commented 1 year ago

Thanks for your update.

I build the latest source and found that the "--cache" and "--prefix" options are valid. The cached sketch files are no longer adjacent to the genome files. I run as ./dashing2 sketch -F bacteria.list -S 1024 -p 48 -o bacteria.sketch --prefix bacteria_sketch --cache, there will be output sketch files: bacteria.sketch and bacteria.sketch.names.txt, and the cached files are stored in the directory of bacteria_sketch.

However, there are new problems when computing the distance. I try to compute the all-vs-all pairwise distances of these genomes by the pre-sketched file(bacteria.sketch), as: ./dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 48, but it failed with these error logs:

 Don't have permission to map.
: Invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid argument
Aborted (core dumped)

And the bacteria.sketch has been overwritten. I need to re-generate the sketch file. When I re-generated the sketch files, I did not know how to use the cached sketch files in the bacteria_sketch directory.

So, it generates two new questions:

Thank you very much! Best, Xiaoming

dnbaker commented 1 year ago

Hi Xiaoming,

You found another bug. Thank you! There's a large feature-set (lots of ways to run computation), and we were erasing the sketched data because I was opening the file in the wrong mode to load the data for analysis. The code path that I've used this concatenated sketch file with primarily was directly calling cmp with -o enabled in order to mmap the large set of sketches for large runs. I should have made sure this was tested.

Your command is correct! I'm able to run that command exactly after fixing this bug.

dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 4

So:

For point (2), we don't have a command for it, but I just added a python function which does this. In python/parse.py, there's a new function convert_sketches_to_packed_sketch.

You call it like:

from python.parse import python.parse.convert_sketches_to_packed_sketch
import glob
paths = list(glob.iglob("bacteria_sketch/*ss"))
individual_sketches = convert_sketches_to_packed_sketch(paths, "packed_bacteria.sketch")

Now, it might be faster to just re-sketch, but that's an option.

Here's the PR: https://github.com/dnbaker/dashing2/pull/70.

Thanks again! I really appreciate it. Again, I've updated the main branch, but new binaries will be available tonight or tomorrow now.

Best,

Daniel

XiaomingXu1995 commented 1 year ago

Hi daniel, Thanks for your update in time! I run with dashing2 cmp --cmpout bacteria.dist --presketched bacteria.sketch -p 4, but it generates the output distance file bacteria.dist in a binary format, not human-readable. Do you have the same problem?

Best, Xiaoming

dnbaker commented 1 year ago

Hi Xiaoming,

Thanks again! I can reproduce it, but only with some builds, which is confusing to me.

The version I've built of dashing2 for my laptop is working, but only when I removed -flto from the build command did I get normal text output.

I've made changes in this PR:

https://github.com/dnbaker/dashing2/pull/72

I'm adding some new binaries; could you please give it another shot? Here are the OSX, and I'll get the linux later today. (https://github.com/dnbaker/dashing2-binaries/tree/main/osx/v2.1.14) I think I'll wait until you confirm that it's fixed to merge it in in case there are more issues.

Thanks!

Daniel

XiaomingXu1995 commented 1 year ago

Hi Daniel, Thanks for your update!

I can get the human-readable result by compiling without -flto option. Both computing distance from genome files directly and from pre-generated sketches work well.

Besides, I have tested the binaries v2.1.14 dashing2_savx2 on an AMD workstation and dashing2_s512bw on an Intel workstation. Both of them work well.

Thank you for your work again! Best, Xiaoming