GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
382 stars 136 forks source link

Unable to define clusters #191

Closed Sophiesze closed 4 years ago

Sophiesze commented 4 years ago

Hi GreenleafLab,

Sorry at first, I think this does not belong to the tag Documentation Request. But I don't know which tag I should tag my problem.

I have been analyzing my 10X data and I found that there are some clusters that I am not able to define. Also, It seems like missing some important clusters. I wonder it is because of the way I do quality control wrong or it is because of the data.

Thus, I try to change the cut off value(TSSenrichment and log10nf) when doing QC. But there still some clusters in the middle of the UMAP that I was not able to define and the cell numbers are not small. I read the article that GreenleafLab published on Nature Biotechnology, the cutoff is TSS>8, log10nf>3 & <4.5. As for my data, I chose TSS>10 & log10nf>3.5 & <4.2. The pre-QC plot is attached.

I have been thinking of this problem for several days, hope you can help me with it.

Thank you so much!

Best, Sophie

pre-qc umap

rcorces commented 4 years ago

In this case, I think your cutoffs should look like this: image

We dont normally apply upper bounds on the number of fragments per cell, though if you wanted to apply one I would make it closer to 4.5 rather than 4.2.

Have you removed doublets?

Have you tried getting marker genes from gene scores for your unknown clusters?

Sophiesze commented 4 years ago

Thanks for your suggestions, I will try to change the arguments and run it again.

I have already removed some doublets with the default arguments using filterDoublets(). I also get the marker genes from the GenesoreMatrix. It is hard to define the clusters that always gathering in the middle no matter how I select the cutoff value in quality control.

Also, I'm also not sure about whether the two clusters on two side of the B cells is pDC and DC (on the top of the panel). I have already posted the MarkerUMAP below.

Thanks again for your reply!

5-Plot-UMAP-Marker-Genes-WO-Imputation.pdf

Sophiesze commented 4 years ago

I only removed the doublets in the sample which R2>0.9 and skipped for those R2<0.9.

rcorces commented 4 years ago

can you show a UMAP colored by (1) Sample and (2) doublet enrichment?

Sophiesze commented 4 years ago

Hi, I ran my data according to your suggestion on the QC value. One with the upper bound limitation of the fragment and the other without it to see which one will be better. Both of the plots are posted below. Actually I can't tell which one is better. Cell numbers with upper bound limitations are 91352, the other one is 93273. I still can tell some batch effects after doing harmony. Sorry that I have to hide the sample names on Github.

Thanks for your help!

-With the upper bound limitation of log10nf<4.5:

UB-samples UB-clusters

04-UB-DoubletEnrichment-Umap-0603.pdf

5-withUB-Plot-UMAP-Marker-Genes-WO-Imputation.pdf

-Without the upper bound:

without-UB-sample withoutUB-clusters

04-withoutUB-DoubletEnrichment-Umap-0603.pdf

5-withoutUB-Plot-UMAP-Marker-Genes-WO-Imputation.pdf

rcorces commented 4 years ago

Unfortunately, I do not think this is an ArchR issue. This looks like a problem with your data. Since you had trouble with doublet identification, I'm concerned that this is contributing to your problems. I previously asked for a UMAP colored by the doublet enrichment which I would still encourage you to look at.

We cant really provide solutions for individual user's applications and theres no way for us to know what the mystery clusters are. The advice for identifying clusters or manually defined cell groups is to use marker features or marker gene scores. I'm closing this issue for now. Sorry we could not be of more help.