dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Produce single sketch from multiple input files #26

Closed bede closed 4 years ago

bede commented 5 years ago

I'd like to produce a single sketch from multiple input files – forward and reverse reads in this case. Is this possible with Dashing?

Mash lets me do it like this:

mash sketch -m 5 -I ec_r12 -o ec_r12_sketch ec_1.fq.gz ec_2.fq.gz

Else by piping to stdin:

cat ec_1.fq.gz ec_2.fq.gz | mash sketch -m 5 -I ec_r12 -o ec_r12_sketch -
dnbaker commented 5 years ago

Currently, this is only supported with the union subcommand, but we are working to support multiple files per sketch out of the box. (One would need to ‘dashing sketch’ both fastqs and ‘dashing union in1.fq.hll in2.fq.hll -o in.hll’.)

Thanks for the request! I’ll prioritize this more, as it’s not the first time it’s been mentioned. I’m traveling currently, but hope to have time in the next week.

bede commented 5 years ago

Ah, I'd overlooked the union subcommand, thanks : )

But yes, sketching from stdin would be a very useful addition.

dnbaker commented 5 years ago

Hi!

I know this took a while. However, I believe I have this supported in the pairs branch which, after some large-scale testing, I expect to merge into master relatively soon.

To signify that multiple filenames are sources for one sketch, delimit an arbitrary number of filenames on the same line in the file from the -F flag.

Specifying destination files will take a little more work, but that will also be in the works.

dnbaker commented 4 years ago

Currently, this is supported by using the -F/-Q parameters to pass in files and listing multiple per file, separating them by a space. I'm closing for now, but feel free to open if you have any further issues.