dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Processing each sequence in a file separately #17

Closed VGalata closed 4 years ago

VGalata commented 5 years ago

First of all, thank you for the great tool!

Being used to Mash, I currently miss the option to tell the tool to process the sequences in a FASTA file separately rather than belonging to one genome. I checked the available flags multiple times but could not find anything related to this functionality.

Thank you in advance!

dnbaker commented 5 years ago

We don't currently support processing each sequence within a file separately (outside of an undocumented feature that we plan to replace with a new subcommand).

Are you more interested in comparing a query against each chromosome of a reference (IE, sketching each chromosome separately), or comparing each chromosome against a set of references? Our current plans were to support the first kind of query, but we could work toward supporting either.

Thanks for the suggestion!

olgabot commented 5 years ago

Yes please! This would be very helpful. Currently to compare all cells from a 10x run or otherwise, it's best if I create separate fasta files for each cell, which can be 1000s of cells. Also, this is helpful when comparing metagenomes/transcriptomes where each "species" is its own record in a fasta. It'd be much easier to use if dashing could parse a fasta/fastq as multiple records per file.

VGalata commented 5 years ago

Dear @dnbaker,

I think both scenarios are interesting for the user. Currently, I am rather interested in comparing a sample against a set of other samples/references. As @olgabot says, depending on the analysis, one might have a FASTA with multiple genomic sequences and splitting it into separate files might be inconvenient.

andreaswallberg commented 5 years ago

One more vote for this feature!

I am working on a de novo assembly project, which currently has many fragmented contigs. I'd like to use this feature to compute contig vs contig distances to help identify haplotigs, among many other things.

andreaswallberg commented 5 years ago

We are comparing a large and fragmented genome assembly against itself, for which we know that the assembly is larger than the genome size, which likely due to high heterozygosity and excess of haplotigs.

However, we are getting very reasonable genome size estimates out of dashing with the "-o" command, and it would be great if dashing could report contig-vs-contig values such we could possibly identify contigs that might by redundant.

dnbaker commented 5 years ago

We're interested in supporting this feature, and I'm working on it in a separate branch. I don't expect it to be ready soon just yet, but it is in the works.

dnbaker commented 5 years ago

This is available in draft form now, in branch by_seq. One sketches with dashing sketch_by_seq <options> -o output.bin input.fastq.gz, at which point output.bin contains the sketches and output.bin.names contains the sequence names.

One then performs:

dashing dist_by_seq -o output.tsv -n output.bin.names output.bin to perform the distance calculation. It's a bit clunky and will be under flux as I implement the rest of our functionality, but it does work.

I've merged this into master, tentatively. It hasn't been exhaustively tested, but should support primary use cases.

dnbaker commented 4 years ago

I'm closing this for now, but feel free to open if you have any further issues.