K means clustering used. Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

Ankita-1211 commented 11 months ago

I am importing data from deeptools compute matrix. The import works fine but i get error while doing clustering.

My data does not not seems to have NA

compute matrix cmd computeMatrix reference-point --regionsFileName ./grc42_basic_gene_pc.bed --scoreFileName ./*.bw --referencePoint TSS --beforeRegionStartLength 2000 --afterRegionStartLength 2000 --binSize 100 --averageTypeBins mean --missingDataAsZero --numberOfProcessors 32 --outFileName matrix.gz

dat_mat <- import_deepToolsMat(con='/home/ubuntu/Downloads/matrix.gz')

dat_mat class: profileplyr dim: 16738 40 metadata(0): assays(6): Sorted_Hindbrain_day_12_1_Small_bigWig Sorted_Hindbrain_day_12_2_Small_bigWig ... Sorted_Liver_day_12_1_Small_bigWig Sorted_Liver_day_12_2_Small_bigWig rownames: NULL rowData names(3): names score dpGroup colnames: NULL colData names(0):

clusterRanges(object, fun = rowMeans, kmeans_k = 3) K means clustering used. Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

dougbarrows commented 11 months ago

Would you be able to share the matrix? I'd be happy to debug the issue and having the matrix in hand would make that a lot easier.

pwnunub commented 4 months ago

Is there any follow up on this thread? I am having similar issues.

dougbarrows commented 4 months ago

Hi there- I never got my hands on the previous matrix that was causing issues. Would it be possible for you to share an example of a deeptools matrix (or the resulting profileplyr object) that is giving you the error?

pwnunub commented 4 months ago

Hi, I have attached an rds file containing the proplyrObject.

proplyrObject.zip

dougbarrows commented 4 months ago

It looks like the object has some rows that contains all zeroes (no signal). By default the clusterRanges function will scale the rows, so if a row has no variance, then the row cannot be scaled. We should include a more sensible error when this happens, or deal with it internally, so thanks for pointing this out.

The clustering should work if you include 'scaleRows = FALSE' argument in the clusterRanges function, but this will likely cluster the rows based on signal as opposed to patterns that exist in your data, so it's likely not what you are looking for.

What you probably want to do is remove the rows from the object that have no changes across the samples (no variance, likely all zeroes), and then do the clustering. See the code below for an example of how to do this. Let me know if this doesn't work for you.

object <- readRDS("~/Downloads/proplyrObject")

# summarize object with same function used for clustering (rowMeans) and identify rows with no variance
# this will almost certainly be because a row had all zeroes
mat_sum <- profileplyr::summarize(object, fun = rowMeans, output = "matrix")
bad_rows <- rowVars(mat_sum) == 0

object2 <- object[!bad_rows, ]
clusterRanges(object2, fun = rowMeans, kmeans_k = 3)

pwnunub commented 4 months ago

Thanks for the help!

pwnunub commented 4 months ago

I was relatively new to this type of analysis but I was wondering if you could advice as to whether my plan is correct or doable:

Signal files: ATAC normalized signals for TH17-23 and TH17-B (downloaded off GEO as bigwig) testRanges: promoters of canonical mm10 transcripts Creating a ChIP profile of this Clustering the signals at promoters between the two groups and determining the top differentially accessible regions Doing a motif enrichment of the differentially accessible regions

Thanks, Kind regards, Caleb

RockefellerUniversity / profileplyr

K means clustering used. Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) #11