biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
92 stars 95 forks source link

Faster groupby operations #862

Closed mortonjt closed 3 years ago

mortonjt commented 3 years ago

I've been performing quite a number of collapse operations, and am noticing that it can be extremely slow. Namely, I'm trying to map deblurred reads to reference genomes and that process takes several hours (in fact its still running); see code below for an example

# seq_to_genome is a pd.Series that maps reads to genomes obtained from a sam file. 
# for now each read uniquely maps to 1 genome.
# t is the relevant biom table
bin_f = lambda i, x: seq_to_genome.loc[i]
t = t.collapse(bin_f, norm=False, min_group_size=1,                                                                                            
               axis='observation').sort(axis='observation')   

Ultimately, all I want to do is sum up all of the reads that hit the same genome.

Because this collapse function was taking too long, I've rigged together a groupby_sum function that speeds this up via sparse matrix multiplication.

def groupby_sum(table, category):                                                                                                                    
    """ Performs a groupby sum operation on biom table.                                                                                              

    Parameters                                                                                                                                       
    ----------                                                                                                                                       
    table : biom.Table                                                                                                                               
        Biom table to perform groupby on                                                                                                             
    category : pd.Series                                                                                                                             
        Deblur reads to genome mapping                                                                                                                          
    """                                                                                                                                              
    X = table.matrix_data  # d x n for d observations and n samples                                                                                  
    c = category.loc[table.ids(axis='observation')]     # make sure reads are in the same order as table                                                                                             

    to = c.value_counts()                                                                                                                            
    to = pd.Series(np.arange(len(to)), index=to.index)  # mapping from deblurs ids to numbers 
    fr = pd.Series(np.arange(len(c)), index=c.index)  # mapping from genome ids to numbers 
    i = c.apply(lambda x: to.loc[x]).values  # numerical ids for genomes                                                                                                    
    j = fr.values   # numerical ids for deblurred reads                                                                                                                     
    z = np.ones(len(i))  # specifies deblur -> genome mapping via sparse matrix                                                                                                                            
    A = coo_matrix((z, (i, j)))  # d x k  assignment matrix for d reads and k genomes                                                                                                           
    Y = A @ X    # groupby summing via sparse matrix multiplication                                                                                                                                    
    t = biom.Table(Y, list(to.index), table.ids(axis='sample'))                                                                                      
    return t    

This function collapses to genomes within seconds, so it is at least hundreds of times faster than the current collapse function. Would there be interest in incorporating this into biom? Of course, this is a bit more specialized than collapse, so it may warrant its own function.

wasade commented 3 years ago

Why not just do:

for part, tab in table.partition(bin_f):
    counts = pd.Series(tab.sum('sample'), index=tab.ids('sample'))

or maybe I'm misunderstanding the need?

What size table is this and is collapse still slow even without the sort?

mortonjt commented 3 years ago

Sorry, I'm a little confused about the partition function ... The groupby_sum function is closer to what is done with taxonomy collapsing.

The biom table I'm working with is 172505 x 10491 (i.e. 170k deblurred features and 10k samples). Of those 170k reads, 48k reads mapped to 1863 reference genomes.

I want to map all of the deblurred genes to reference genomes, and collapse the deblurred counts for all reads that map to the same genome.

wasade commented 3 years ago

Okay thanks. Is it the sort that's slow? That table isn't particularly large. What version of biom is being used? We do these collapses of a similar scale regularly for microsetta, so something is acting unexpectedly here.

mortonjt commented 3 years ago

This is biom version 2.1.10, I'll get timings for the sort in a bit, but note that sort is across 1863 genome ids, so I doubt that would take >4 hours ...

wasade commented 3 years ago

Sure, but the collapse function uses native scipy.sparse bulk matrix methods so it also shouldn't be slow :) Does the table have metadata? And what happens if you use a dict lookup rather than pandas .loc?

mortonjt commented 3 years ago

bam! That plus removing the sort function seems to have the resolve the problem -- it now runs in seconds. Thanks for the feedback!

wasade commented 3 years ago

Interesting. Woudl it be possible to check if it was the sort or the dict lookup? If former then that is a nono and would be good to address

mortonjt commented 3 years ago

I think the sort function actually could be problematic, since removing that also completed in seconds (with the .loc lookup)

wasade commented 3 years ago

Any chance you could edit your local library, and add self._data = self._data.tocsr() right above this line (https://github.com/biocore/biom-format/blob/master/biom/table.py#L2182)? And, then try it again with self._data = self._data.tocsc()?

What i wonder is if the fancy indexing is not implicitly triggering a structure change w/ scipy. Basically row order or column order.