NKI-CCB / DISCOVER

DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data
Apache License 2.0
27 stars 6 forks source link

How many runtime memory acquired #4

Closed solo7773 closed 6 years ago

solo7773 commented 6 years ago

On my cluster I assigned 16 G RAM, but failed to run

events = discover.DiscoverMatrix(mutBinary)

where mutBinary is a matrix of (26276, 955)


On my laptop configured with 8 G RAM,failed to run

discover.pairwise_discover_test(events[subset], alternative='greater')

where subset has 478 True which means 478 genes, ie 114003 gene pairs, need to be tested.


Could you provide a formula to estimate the memory discover needs?

scanisius commented 6 years ago

I performed some back-of-the-envelope calculations for the memory needs of the two functions you encountered issues with. In the following, M refers to the number of genes, and N to the number of tumours in your mutation matrix.

For DiscoverMatrix, I end up with 56 (M + N) + 16 M N* in bytes. In your case, that would amount to less than 400Mb. There may be some additional memory overhead in the communication between Python and Fortran, but it seems to me that 16Gb should be more than enough. What error message do you get when running this function on your data?

For pairwise_discover_test the largest part of the memory consumption should be summarised by 12 M N + 22 M*2. For your example, assuming the number of tumours is still 955, this would mean more than 14Gb, indeed more than is available on your laptop.

Most of the memory used by pairwise_discover_test is required only for performing the multiple testing correction, which uses a procedure specifically meant for discrete test statistics (such as DISCOVER's). The downside of it is that it is more memory-hungry than a normal Benjamini-Hochberg FDR calculation. One of the planned features for a future release is the ability to choose the normal BH procedure over the adapted one.

In the meantime, if you do not have access to a machine with more RAM, you can use the function pairwise_discover_test_lowmem defined below. It can be used as a drop-in replacement for the normal pairwise_discover_test function. However, since it uses the standard Benjamini-Hochberg procedure for multiple testing correction, you will most likely have fewer significant hits.

import _discover
import discover
import numpy
import pandas

def pairwise_discover_test_lowmem(x, alternative="less"):
    assert alternative in ["less", "greater"]

    events = x.events
    bg = x.bg

    pvalues = _discover.fdr.computep(events, bg, events, bg, alternative == "less")
    pvalues[numpy.tril_indices_from(pvalues)] = numpy.nan
    qvalues = fdr(pvalues.ravel()).reshape(pvalues.shape)

    return discover.pairwise.PairwiseDiscoverResult(
        pandas.DataFrame(pvalues, index=x.rownames, columns=x.rownames),
        pandas.DataFrame(qvalues, index=x.rownames, columns=x.rownames),
        1.0, alternative)

def fdr(p, pi0=1.0):
    if not 0 <= pi0 <= 1:
        raise ValueError("Invalid value for pi0: %s. Legal values are between 0 and 1" % pi0)

    nna = ~numpy.isnan(p)
    q = numpy.repeat(numpy.nan, len(p))
    p = p[nna]

    i = numpy.arange(len(p), 0, -1)
    o = numpy.argsort(p)[::-1]
    ro = numpy.argsort(o)

    q[nna] = numpy.minimum(1, numpy.minimum.accumulate(float(pi0) * len(p) / i * p[o])[ro])
    return q
scanisius commented 6 years ago

This issue has been inactive for a few months, so I am closing it. Please open a new issue if you have further questions.