davidvi / pypanda

Python implementation of PANDA (Passing Attributes between Networks for Data Assimilation)
38 stars 11 forks source link

Division issue #2

Open sinclaircooper opened 6 years ago

sinclaircooper commented 6 years ago

Hi, I'm getting some division errors when trying to run PANDA.

/path/to/.local/lib/python2.7/site-packages/numpy/lib/function_base.py:3167: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /path/to/.local/lib/python2.7/site-packages/numpy/lib/function_base.py:3168: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :]

This appears to be related to np.corrcoef(self.expression_matrix), i.e. there is something in my input counts matrix that means numpy cannot generate a proper correlation matrix. I'm supplying a matrix of tissue aware normalised counts (using YARN).

Does PANDA expect normalised counts, TPMs, log2 counts?

Cheers

mararie commented 6 years ago

Hi! PANDA starts by generating a gene co-expression (correlation) matrix from the expression data. This can be done on many different data types. We prefer to use normalized counts, but TPMs and log2 counts will work too.

The issue you're having can happen if your input data includes genes that do not show any variation in expression. In principle, YARN should filter out genes that are not expressed across a certain percentage of samples (depending on the thresholds you're using), so that is not likely to happen. (It is still possible that a specific gene has the same non-zero count in all samples, but this is rather unlikely.) However, it may be that you're making your network on a subset of all samples, in which one or more genes are just not expressed.

The easiest option is to filter out these genes before running PANDA. Another workaround is changing the PyPanda code to change correlations that return NA to 0 (this is what we did for the MATLAB code we used to run networks on GTEx data).