Closed olgabot closed 5 years ago
Thanks for the find. The extra tabs were a bug recently introduced while refactoring to support the PHYLIP upper triangular format. I've fixed this, along with the extra space in ##Names (though there was a tab as well; it shouldn't have broken field accessibility) as of https://github.com/dnbaker/dashing/commit/2ade3c4468c97ef2f5a4bcdc8586530bfd1dd927.
If you want the full matrix (without the -
characters), another option is running dashing with -b
enabled to emit in binary format, then running dashing printmat <output.bin> > full_matrix.tsv
.
Wow, this is timely. I've just lost a couple of hours to this. I'm also getting some decoding errors in the first line of output including some 0x8f
chars. Looking into 2ade3c4.
I'm still having problems with 2ade3c4 unfortunately on a collection of 473. In addition to the extra column, there are some unexpected characters prepended to each float on line 2 of the file.
Thanks for the workaround @olgabot :-)
Thanks for the check!
I'm a little surprised... I just tried with a set of 1000 synthetic sets, and checked the output manually, as well as with pandas.
In [5]: zomg = pandas.read_csv("dists.txt", sep='\t')
In [6]: zomg.head()
Out[6]:
##Names fake/703.fa fake/690.fa fake/691.fa fake/692.fa fake/693.fa fake/694.fa ... fake/401.fa fake/402.fa fake/403.fa fake/404.fa fake/405.fa fake/406.fa fake/407.fa
0 fake/703.fa - 0.000000 0.000000 0.000000 0.000000 0.023746 ... 0.000000 0.004951 0.008357 0.006508 0.000000 0.015152 0.000000
1 fake/690.fa - - 0.000000 0.011458 0.000000 0.000000 ... 0.013563 0.000000 0.002094 0.001062 0.000000 0.000000 0.000000
2 fake/691.fa - - - 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 fake/692.fa - - - - 0.031205 0.000000 ... 0.020808 0.000000 0.012381 0.000000 0.000000 0.013682 0.000000
4 fake/693.fa - - - - - 0.000000 ... 0.003485 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Looking for extra characters:
In [11]: ''.join(sorted(set(open("dists.txt").read())))
Out[11]: '\t\n#-./0123456789Naefkms'
("fake/fake[0-9]+.fa" and "##Names" are the source of the non-whitespace/dash/numeric characters, so this is as expected.)
Can you perhaps try building with https://github.com/dnbaker/dashing/commit/a09b133b2bf4dcdd4c4b9d373653f332016912a5 ? It shouldn't fix the problem, but maybe something was weird with git. In the meantime, I'll try to reproduce this on a different machine.
Regarding a full TSV, I've added a -T
flag to dashing dist
in the fulltsv branch. You'll need to specify -O{table_path}
(for the matrix), and the labels will be in a separate file called
{table_path}.labels`, but the matrix will be fully filled-in.
dashing dist -TOouttable.tsv *.fa
will then produce a matrix at outtable.tsv
consisting of all pairwise jaccards (including both upper and lower triangular portions) and a set of labels at outtable.tsv.labels
. This might be more convenient.
Hah, glad to hear my pain helped others.
@bede - you can use pd.read_table
directly, which alleviates the need to specify pd.read_csv(..., sep='\t')
@dnbaker thank you for the fix on the fulltsv branch! That is MUCH more convenient than saving to an unknown binary format and then reformatting. Would it be possible to create a "tidy" version as well? I routinely compare 10k+ samples, and reading those large square matrices with pandas often runs out of memory, but that "tidy"/database-style format would greatly reduce the friction in analysis.
Can you explain what you mean by “tidy”? A set of k/v pairs? I’ve found mostly philosophical commentary rather than a standard for a format. I’d honestly wrap distmat in pybind11; the whole point is a memory-efficient (non-redundant) way to provide access to all entries without parsing. Would a quick wrapper for that purpose satisfy your needs instead? Plus, a float-formatted output would cut the size in half.
Tidy data is a concept where each row in a table is a single observation, e.g. here where each row is a single similarity between cell_a
and cell_b
. It's very useful for doing a database-style join with metadata and aggregating statistics of similarities within- and across-celltype, which helps me evaluate how "correct" the similarity is.
Here's an example of reformatting a "wide" data matrix into a "tidy" version:
Thanks @dnbaker for persisting with this. I'll see if I can reproduce at the weekend. I noticed that ~100 samples worked fine but 470 did not. Will do some more digging with the latest branch and report back.
a pairwise 'tidy data' table makes sense for a large matrix although I've yet to need one. If dealing with huge matrices then perhaps vanilla Pandas isn't the most appropriate tool anyway, @olgabot.
Which flags should be set in order to use -T
in the fulltsv branch? I received an unexpected argument error without -b
, tried it and receive this failure at the writing step. Ubuntu 16, GCC 5.4
$ dashing dist -k 31 -p 5 -b -T an/dashing/kp_distances_k31.tsv -o an/dashing/kp_sizes_k31.txt data/kp_fastq/*.fq.trim.gz
terminate called after throwing an instance of 'std::runtime_error'
what(): [./bonsai/bonsai/include/encoder.h:void bns::Encoder<ScoreType>::for_each(const Functor&, const char*, kseq_t*) [with Functor = bns::dist_main(int, char**)::<lambda(bns::u64)>; ScoreType = bns::score::Lex]419] Could not open file at an/dashing/kp_distances_k31.tsv. Abort!
Does the folder an/dashing exist? It’s not able to open the file for writing.
Yes. The process succesfully created an empty kp_sizes_k31.txt
in the same directory.
Does the folder an/dashing exist? It’s not able to open the file for writing.
Please ignore now deleted comment from a few minutes ago. Can reproduce above problem with different data. There are also sketches present in this dir. I'm unsure how dashing handles fastqs with corresponding sketches by default, so thought I would mention it. Let me know if you'd like a tgz to help reproduce. Same Ubuntu 16 & GCC 5.4 system.
bede@ndm.local@mmm-analysis1:~/sinking$ res/dashing_fulltsv/dashing dist -k 31 -p 4 -b -T an/dashing/ec_distances_k31.tsv -o an/dashing/ec_sizes_k31.txt data/ec_fastq/1*.fq.trim.gz
terminate called after throwing an instance of 'std::runtime_error'
what(): [./bonsai/bonsai/include/encoder.h:void bns::Encoder<ScoreType>::for_each(const Functor&, const char*, kseq_t*) [with Functor = bns::dist_main(int, char**)::<lambda(bns::u64)>; ScoreType = bns::score::Lex]419] Could not open file at an/dashing/ec_distances_k31.tsv. Abort!
Aborted (core dumped)
bede@ndm.local@mmm-analysis1:~/sinking$ res/dashing_fulltsv/dashing dist -k 31 -p 4 -b -T an/dashing/ec_distances_k31.tsv -o an/dashing/ec_sizes_k31.txt data/ec_fastq/2*.fq.trim.gz
terminate called after throwing an instance of 'std::runtime_error'
what(): [./bonsai/bonsai/include/encoder.h:void bns::Encoder<ScoreType>::for_each(const Functor&, const char*, kseq_t*) [with Functor = bns::dist_main(int, char**)::<lambda(bns::u64)>; ScoreType = bns::score::Lex]419] Could not open file at an/dashing/ec_distances_k31.tsv. Abort!
Aborted (core dumped)
bede@ndm.local@mmm-analysis1:~/sinking$
Edit: I realized what the problem is. Dashing is trying to sketch from an/dashing/ec_distances_k31.tsv
as if it's a fastq file. -T
and -b
don't take any arguments. You need to specify -O an/dashing/ec_distances_k31.tsv
to write the table to that file; that's why it's the encoder, not the writer, that's throwing the error.
To answer your question on fastqs, dashing uses all kmers, including errors. These can be filtered for events that occur above a threshold in dashing sketch
-n{min_count}
with -N
(to detect by whether or not {.fq,.fastq}
is in the filename) or -B
to treat all files as fastq. Apparently these options weren't propagated to dist; you'd currently run the same dist command after sketching, but specify -W
to tell it to use the cached sketches from your sketching.
Thanks. I thought it might be down to misuse of arguments. Great to hear that there is count thresholding – one less preprocessing step.
Is this TSV stuff likely to make it into master, or shall I stick with fulltsv?
First I wanted to make sure it wasn’t broken! I’m glad to hear it’s working! Let me add gz support and labels, and I’ll merge it in.
Full TSV support, with labels and compressed output support as of https://github.com/dnbaker/dashing/commit/2a49601ed57d80278d2ce715c6dac4abe749f909. (Use -z flag [does not consume an argument] to specify writing to a compressed file.) It writes zstd-compressed if you ask for it.
@olgabot, mind if I close this?
I'm closing for now, but feel free to reopen it if there are further issues.
late to the party! yes thank you for fixing
Thank you for making a valuable request! I think the final tsv output was a good change, and I appreciate you finding the extra column, which unacceptably broke downstream parsing.
Thanks from me also : )
Hello, When comparing 4 samples, the string output by
dashing dist
has 5 columns:##Names
(with a space!! in a tab-formatted file!! arhghghghgh) and the 4 sample names, which allows for easy parsing with pandas + StringIO:However, when comparing 19 samples, somehow an extra column gets added to the output so
##Names
is a column with all empty distances (-
) and can be readily dropped, but the other columns are informative.For these comparisons, I use a symmetric matrix and while it is redundant, it helps to clean up downstream calculations. To fully clean the matrix into a usable format for myself, this is my code:
And here's the reformatted dataframe:
To (hopefully) alleviate formatting issues, could it be possible to output a readily-formatted csv of the comparisons?
Since the pairwise distance matrices can get enormous, it could also be an option to output a "tidy" formatted table of the just the (n-1)^2/2 comparisons, instead of the whole n^2 matrix. For example, using
dashing_similarities
from above, one can reformat to a tidy data table ofcell_a
,cell_b
and the similarity this way:(This particular dataframe still includes the (n-1)^2 comparisons, but doesn't have to)
What do you think? Warmest, Olga