Closed mortonjt closed 3 years ago
Why not just do:
for part, tab in table.partition(bin_f):
counts = pd.Series(tab.sum('sample'), index=tab.ids('sample'))
or maybe I'm misunderstanding the need?
What size table is this and is collapse still slow even without the sort?
Sorry, I'm a little confused about the partition function ...
The groupby_sum
function is closer to what is done with taxonomy collapsing.
The biom table I'm working with is 172505 x 10491
(i.e. 170k deblurred features and 10k samples).
Of those 170k reads, 48k reads mapped to 1863 reference genomes.
I want to map all of the deblurred genes to reference genomes, and collapse the deblurred counts for all reads that map to the same genome.
Okay thanks. Is it the sort that's slow? That table isn't particularly large. What version of biom is being used? We do these collapses of a similar scale regularly for microsetta, so something is acting unexpectedly here.
This is biom version 2.1.10
, I'll get timings for the sort in a bit, but note that sort is across 1863 genome ids, so I doubt that would take >4 hours ...
Sure, but the collapse function uses native scipy.sparse bulk matrix methods so it also shouldn't be slow :) Does the table have metadata? And what happens if you use a dict lookup rather than pandas .loc?
bam! That plus removing the sort function seems to have the resolve the problem -- it now runs in seconds. Thanks for the feedback!
Interesting. Woudl it be possible to check if it was the sort or the dict lookup? If former then that is a nono and would be good to address
I think the sort function actually could be problematic, since removing that also completed in seconds (with the .loc lookup)
Any chance you could edit your local library, and add self._data = self._data.tocsr()
right above this line (https://github.com/biocore/biom-format/blob/master/biom/table.py#L2182)? And, then try it again with self._data = self._data.tocsc()
?
What i wonder is if the fancy indexing is not implicitly triggering a structure change w/ scipy. Basically row order or column order.
I've been performing quite a number of collapse operations, and am noticing that it can be extremely slow. Namely, I'm trying to map deblurred reads to reference genomes and that process takes several hours (in fact its still running); see code below for an example
Ultimately, all I want to do is sum up all of the reads that hit the same genome.
Because this collapse function was taking too long, I've rigged together a
groupby_sum
function that speeds this up via sparse matrix multiplication.This function collapses to genomes within seconds, so it is at least hundreds of times faster than the current collapse function. Would there be interest in incorporating this into biom? Of course, this is a bit more specialized than
collapse
, so it may warrant its own function.