GIS-SP-Group / RCA

R package for robust clustering of single cell RNA sequencing data
MIT License
38 stars 24 forks source link

Error - zero dimension #2

Open asmagen opened 7 years ago

asmagen commented 7 years ago

Hello, I get the following error after following the manual for a single-cell dataset I'm working with.

data_obj = featureConstruct(normalized,method = "SelfProjection") Error in cor(fpkm_for_clust0, method = "pearson") : 'x' has a zero dimension

Why does it happen and how can I solve this? Thanks, A

asmagen commented 7 years ago

Also I get this error: Error in cor(fpkm_temp, method = "pearson") : Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'. I didn't have any NA values in my dataset. Any idea what might cause that? Thank you.

GIS-SP-Group commented 7 years ago

Dear Asmagen,

Have you followed the gene name requirement as stated in the manual?

######################################################################### Input data: A data frame of expression values (FPKM, TPM, UMI counts ...), with rows representing genes and columns representing cells. Note the current version of RCA only accepts gene names in the following format: "GenomeLocation_HGNCGeneName_EnsembleID", from which the "HGNCGeneName" is extracted for RCA analysis. For input data with only HGNC names, the users need to attach two strings to the HGNC names to make them into the "XXXX_HGNCGeneNames_YYYY" format" #########################################################################

asmagen commented 7 years ago

So for gene symbol ‘BRCA1’ I need to use ‘XXXX_BRCA1_YYYY’?

On Apr 18, 2017, at 11:37 PM, GIS-SP-Group notifications@github.com wrote:

XXXX_HGNCGeneNames_YYYY

GIS-SP-Group commented 7 years ago

Correct. Sorry for the inconvenience and we will improve this in the next version.

Huipeng

asmagen commented 7 years ago

The same issue still occurs. It doesn't have to do anything with the gene names. What can be done about it?

GIS-SP-Group commented 7 years ago

Asmagen,

Wonder if you followed the procedure in Vignettes.

Please paste your script here.

Huipeng

asmagen commented 7 years ago

library(RCA)

construct data object

rownames(dataset$counts) = sapply(rownames(dataset$counts),function(v) paste('XXXX',v,'YYYY',sep='_')) data_obj = dataConstruct(dataset$counts);

filt out lowly expressed genes

data_obj = geneFilt(obj_in = data_obj);

normalize gene expression data

data_obj = cellNormalize(data_obj,method='scQ');

log transform the data

normalized = dataTransform(data_obj,"log10");

project the expression data into Reference Component space

data_obj = featureConstruct(normalized,method = "SelfProjection")

generate cell clusters

data_obj = cellClust(data_obj,method="hclust",deepSplit_wgcna=environment$cluster.param2,min_group_Size_wgcna=2)

cluster.association = data_obj$group_labels_color$groupLabel

GIS-SP-Group commented 7 years ago

Hi, Asmagen,

Could you provide the table of "normalized$fpkm_transformed" via email? It seems that the "featureConstruct" failed to select any features.

Huipeng

asmagen commented 7 years ago

It’s unpublished data so I can’t. It doesn’t make much sense that the issue is specific to my dataset also.

On Apr 23, 2017, at 7:25 PM, GIS-SP-Group notifications@github.com wrote:

Hi, Asmagen,

Could you provide the table of "normalized$fpkm_transformed" via email? It seems that the "featureConstruct" failed to select any features.

Huipeng

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GIS-SP-Group/RCA/issues/2#issuecomment-296510307, or mute the thread https://github.com/notifications/unsubscribe-auth/AKxq8FgS0IzPG-WkjeEP9lWuQdLXoo3Hks5rzAgWgaJpZM4NAlE6.

GIS-SP-Group commented 7 years ago

Ok, since your script works well on our data set, this issue is likely specific to your data set.

Let me know if you are ok with sharing the following information, which might help us to figure out what's going on.

dim(normalized$fpkm_raw) dim(normalized$fpkm) sum(normalized$geneFilter) dim(normalized$fpkm_transformed) max(normalized$fpkm_transformed) min(normalized$fpkm_transformed)

asmagen commented 7 years ago

Sure.

dim(normalized$fpkm_raw) [1] 14919 1441 dim(normalized$fpkm) [1] 13389 1441 sum(normalized$geneFilter) [1] 13389 dim(normalized$fpkm_transformed) [1] 7724 1441 max(normalized$fpkm_transformed) [1] 2.045323 min(normalized$fpkm_transformed) [1] 0

On Apr 23, 2017, at 8:00 PM, GIS-SP-Group notifications@github.com wrote:

Ok, since your script works well on our data set, this issue is likely specific to your data set.

Let me know if you are ok with sharing the following information, which might help us to figure out what's going on.

dim(normalized$fpkm_raw) dim(normalized$fpkm) sum(normalized$geneFilter) dim(normalized$fpkm_transformed) max(normalized$fpkm_transformed) min(normalized$fpkm_transformed)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GIS-SP-Group/RCA/issues/2#issuecomment-296513823, or mute the thread https://github.com/notifications/unsubscribe-auth/AKxq8DpI_GuTIVe5__9I7x_99QpkAV5Eks5rzBAygaJpZM4NAlE6.

asmagen commented 7 years ago

Any news?

GIS-SP-Group commented 7 years ago

Dear Asmegen,

My guess is that the size of your matrix is not compatible with some hard-coded parameters in the package. We need to explore more for a solid answer though.

You could try to run the package with a randomly chosen subset (~500 cells) and see if the problem still exists.

H

asmagen commented 7 years ago

Hello, The featureConstruct works when I select random 500 cells, which is a very small number in comparison to the recent ScRNA-Seq technologies. But the actual clustering fails: Error in cor(fpkm_temp, method = "pearson") : Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'.

The code has hard coded parameters that relate to the matrix size? How can it be resolved asap? Thanks, A

GIS-SP-Group commented 7 years ago

Hi, Asmagen,

We have tested our package on many data sets available on our side and it seems to work fine. We are indeed optimizing the package and will release the next version in the next couple of months.

But to have a quick solution for you, we really need something to mimic the difficulty you encountered. We don't need to see your full raw data set. But if you could generate a fake set that could be representative of the original one, that would be great.

Let me know how you think.

H

asmagen commented 7 years ago

Attached a subset of the 3k pbmcs published as an example of the Seurat package. The RCA method didn't work for this public dataset as well. Please let me know what's the status when you have news. example.data.RData.zip

asmagen commented 7 years ago

Hello, What's the status? Thanks, A

enhaofrank commented 7 years ago

Hi, two guys. Dose the problem have been solved ? I also get the same error,and my data produced from 10X genomics single cell cellranger pipeline. The data frame of expression values is UMI counts, with rows representing genes and columns representing cells. And gene names is changed to the following format: "GenomeLocation_HGNCGeneName_EnsembleID" .The error info : data_obj = featureConstruct(normalized,method = "SelfProjection") Error in cor(fpkm_for_clust0, method = "pearson") : 'x' has a zero dimension

Thank you very much! Frank

wiseflying commented 7 years ago

Dear all,

We have been testing the performance of RCA on multiple datasets on our side. For data sets from dropseq protocol, since they are usually under shallow sequencing, some of the cells might have very few expressed genes (FPKM or UMI count >0). This will cause some problem of RCA.

So when running RCA for large data sets, please do a preliminary QC to filter out bad quality cells (with sum(FPKM>0) <=1000 or sum(FPKM>0)<=500, the same of UMI count data).

Please let me know if more stringent QC would solve the problem.

best Huipeng