Subsample performance - Githubissues

ElDeveloper commented 10 years ago

Subsample seems to be under-performing, this is the table that brought this to @adamrp's and my attention:

In [24]: bt
Out[24]: 13474 x 906 <class 'biom.table.Table'> with 638906 nonzero entries (5% dense)

In [25]: %%timeit bt = load_table('otu_table.biom')
   ....: _ = bt.subsample(1111, 'observation')
   ....: 
1 loops, best of 3: 19.6 s per loop

wasade commented 10 years ago

Any chance you could do a profile? (%prun i believe)

wasade commented 10 years ago

nm, ill run

wasade commented 10 years ago

Damn, need to profile within cython to get at it

1752274 function calls (1573488 primitive calls) in 18.642 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   17.508   17.508   17.595   17.595 {biom._subsample._subsample}
 164408/2    0.308    0.000    0.781    0.390 copy.py:145(deepcopy)
    27854    0.157    0.000    0.157    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   164408    0.095    0.000    0.124    0.000 copy.py:267(_keep_alive)
  14382/2    0.062    0.000    0.781    0.390 copy.py:234(_deepcopy_tuple)
    13474    0.060    0.000    0.346    0.000 copy.py:226(_deepcopy_list)
    14380    0.059    0.000    0.661    0.000 copy.py:306(_reconstruct)
   343206    0.056    0.000    0.056    0.000 {method 'get' of 'dict' objects}
        2    0.051    0.026    0.184    0.092 {biom._filter._filter}
        2    0.051    0.025    0.060    0.030 table.py:398(cast_metadata)
   301864    0.033    0.000    0.033    0.000 {method 'append' of 'list' objects}
   371054    0.028    0.000    0.028    0.000 {id}
    13474    0.025    0.000    0.119    0.000 table.py:2297(<lambda>)
    27854    0.024    0.000    0.181    0.000 _methods.py:23(_sum)
    14380    0.023    0.000    0.023    0.000 {method '__reduce_ex__' of 'object' objects}
    14382    0.016    0.000    0.016    0.000 {range}
    28762    0.015    0.000    0.015    0.000 {getattr}
   122172    0.013    0.000    0.013    0.000 copy.py:198(_deepcopy_atomic)
        1    0.009    0.009   18.637   18.637 table.py:2212(subsample)
    43150    0.009    0.000    0.009    0.000 {isinstance}
        6    0.006    0.001    0.006    0.001 util.py:201(index_list)
    14380    0.006    0.000    0.100    0.000 {method 'sum' of 'numpy.ndarray' objects}
        1    0.005    0.005   18.642   18.642 <string>:1(<module>)

wasade commented 10 years ago

@ElDeveloper, btw, your %%timeit includes the time to load the table

ElDeveloper commented 10 years ago

No, I don't think the %%timeit includes the time to load the table, with %%timeit the first line is a setUp-like method that gets executed with every loop but is not taken into account for the time measurements. Cool thing I learned today.

On (Aug-18-14|16:53), Daniel McDonald wrote:

@ElDeveloper, btw, your %%timeit includes the time to load the table

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/531#issuecomment-52572843

wasade commented 10 years ago

what? crazy, okay.

I can't get the profiler to actually show the cython code unfortunately. It seems like adding the directive and following the tutorial here is not working for whatever reason, so I can't make any suggestions about what is slow. I think its the array expansion, and I've posted on SO about it.

Note, this is still like an order of magnitude faster than QIIME's rarefaction code at least. I had it even faster in pure C, but agree, it would be nice if this were sped up.

ElDeveloper commented 10 years ago

Yeah super heandy, found it in this IPython issue: https://github.com/ipython/ipython/issues/5767#issuecomment-41920887

In that case then this is not really an issue?

On (Aug-18-14|17:33), Daniel McDonald wrote:

what? crazy, okay.

I can't get the profiler to actually show the cython code unfortunately. It seems like adding the directive and following the tutorial here is not working for whatever reason, so I can't make any suggestions about what is slow. I think its the array expansion, and I've posted on SO about it.

Note, this is still like an order of magnitude faster than QIIME's rarefaction code at least. I had it even faster in pure C, but agree, it would be nice if this were sped up.

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/531#issuecomment-52575909

wasade commented 10 years ago

Okay, think I found one way to optimize (using Python/numpy not cython below). I kicked the inner expansion loop around and got this:

>>> def expand(a):
...     b = np.empty(a.sum(), dtype=np.int32)
...     b_idx = 0
...     for a_idx, i in enumerate(a):
...         for j in range(i):
...             b[b_idx] = a_idx
...             b_idx += 1
...     return b
>>> a = randint(0, 1000, 1000)
>>> %timeit expand(a)
10 loops, best of 3: 83.9 ms per loop
>>> r = arange(1000)
> %timeit repeat(r, a)
100 loops, best of 3: 2 ms per loop

Turns out we can use repeat, so I think this will be a quick fix. Will issue a PR momentarily

Jorge-C commented 10 years ago

It's not possible to "see" inside a function using cProfile, so maybe that's what you were seeing when trying to profile _subsample, Daniel?

Thanks for the %%timeit trick @ElDeveloper!

ElDeveloper commented 10 years ago

No problem!

On (Aug-21-14|10:59), Jorge Cañardo Alastuey wrote:

It's not possible to "see" inside a function using cProfile, so maybe that's what you were seeing when trying to profile _subsample, Daniel?

Thanks for the %%timeit trick @ElDeveloper!

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/531#issuecomment-52957894

wasade commented 10 years ago

@Jorge-C, the docs here show relevant lines within the .pyx, right?

Jorge-C commented 10 years ago

Yes, but they're just the line were function definition starts. E.g, I get

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.788   10.788   11.106   11.106 _subsample.pyx:18(_subsample)

because def _subsample(... is in line 18 of _subsample.pyx for me (added a few imports etc on top).

wasade commented 10 years ago

Ah, okay, i see. my bad. Agree the next target is permutation... any ideas?

On Thu, Aug 21, 2014 at 12:59 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:

Yes, but they're just the line were function definition starts. E.g, I get

ncalls tottime percall cumtime percall filename:lineno(function) 1 10.788 10.788 11.106 11.106 _subsample.pyx:18(_subsample)

because def _subsample(... is in line 18 of _subsample.pyx for me (added a few imports etc on top).

— Reply to this email directly or view it on GitHub https://github.com/biocore/biom-format/issues/531#issuecomment-52966157.

wasade commented 9 years ago

Fooled around with this further, in particular with doing smarter thing to aggregate counts (see here), however I was not able to net any additional gains. Also attempted replacing permutation(...)[:n] with choice(..., n, replace=False) but that change did not result in a difference either. Last, tweaked types, disabled boundscheck and wraparound and also did not notice a significant difference. So, I'm closing this issue for now as there isn't a clear path forward here for serial code. I think the next approach would be to thread this.

wasade commented 9 years ago

To add, I looked at the numpy source. permutation and choice were comparable as choice is doing a permutation behind the scenes. On more focused investigation, it is permutation that is the slow step. The numpy implementation is using Fisher-Yates, and is concise, and I believe this is effectively bound by the performance of the PRNG. The next step to improve performance here I believe is to target a multi-threaded. Cython has builtin support for openmp, but that'll take more investigation than is feasible for the pending release

biocore / biom-format

Subsample performance #531