KrishnaswamyLab / phateR

PHATE dimensionality reduction method implemented in R
GNU General Public License v2.0
77 stars 9 forks source link

Question about input data formatting #18

Closed cohnr closed 6 years ago

cohnr commented 6 years ago

I have a table of raw counts of single cells that I am inputting to the phateR pipeline. The column headings are the cell names and the row headings are the gene names. When running phate(counts_table) I'm seeing an error after "Calculating SVD..."

_Error in py_callimpl(callable, dots$args, dots$keywords) : ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

There are no NaN values in the data table but I'm wondering if the counts_table should be formatted differently before running phate?

Thanks!

scottgigante commented 6 years ago

Hi @cohnr ,

Thanks for trying out PHATE, and thanks for the bug report!

Before running PHATE on single-cell RNAseq, we normally library size normalize (you can do this with phateR::library_size_normalize and then either square root or log transform the data.

The issue you're having, however, should be unrelated to this. Can I ask you to check if your data has duplicates? You can do this with the following R code, where data is the matrix/data frame you input to PHATE.

sum(duplicated(data))

Thanks, Scott

cohnr commented 6 years ago

Hi Scott,

Thanks so much for your help with this. When I run sum(duplicated(data)) the output is "[1] 19"

Is there a way I could find the duplicated data in my data table? Thanks!

Rachel

scottgigante commented 6 years ago

Hi Rachel,

I will include a patch in the next version of PHATE to check for duplicated cells. In the meantime, you can check which lines are duplicated with

which(duplicated(data))

and filter your data with

data <- data[!duplicated(data),]

Let me know if that fixes your issue.

Thanks, Scott