caokai1073 / uniPort

a unified single-cell data integration framework by optimal transport
MIT License
30 stars 3 forks source link

Can't reproduce result of harmony #3

Open yuxiaokang-source opened 1 year ago

yuxiaokang-source commented 1 year ago

It a wonderful computational tool!! When I test the PBMC pair dataset(https://uniport.readthedocs.io/en/latest/examples/PBMC/pbmc_integration.html) with harmony, I can't reproduce result displayed in your paper.

I run code with

rm(list=ls())
library(Seurat)
rm(list=ls())
suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(ggplot2)
  library(Seurat)
  library(SeuratWrappers)
  library(patchwork)
  library(cowplot)
  library(harmony)
})

options(repr.plot.width = 12, repr.plot.height = 5)

data_seurat = readRDS("../../dataset/PBMC_pair/PBMC_pair_raw.rds")

print("===================Normalize SeuratObject=======")
data_seurat <- NormalizeData(data_seurat, verbose = FALSE)
print("===================Find HVG=====================")
data_seurat <- FindVariableFeatures(data_seurat, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
print("===================scale Data===================")
data_seurat <- ScaleData(data_seurat, verbose = FALSE)
print("===================Running PCA===================")
data_seurat <- RunPCA(data_seurat, npcs = 30, verbose = F)

data_seurat <- data_seurat %>% RunHarmony("domain_id", plot_convergence = F,max.iter.harmony=50)

print("========================Running UMAP==================")
data_seurat <- RunUMAP(data_seurat, reduction = "harmony", dims = 1:30, verbose = F)
print("========================Visulize UMAP=================")
p1=DimPlot(data_seurat, reduction = "umap", group.by = "domain_id", label.size = 10)+ggtitle("Integrated Batch")
p2=DimPlot(data_seurat, reduction = "umap", group.by = "cell_type",label.size = 10)+ggtitle("Integrated Celltype")
p= p1 + p2 
print(p)

I get the result

image

It seems that the integration of harmony is very bad, which is different with the result

image

I also run harmony with python, whose result is very similar to harmony's result in R,

import scanpy.external as sce
import scanpy as sc
import pandas as pd 
labels = pd.read_csv('../UniPort/uniPort-main/PBMC/meta.txt', sep='\t')
celltype = labels['cluster'].values
print(labels.shape)
print(celltype.shape)

adata_rna = sc.read('../UniPort/uniPort-main/PBMC/rna.h5ad')
adata_atac = sc.read('../UniPort/uniPort-main/PBMC/atac_meastro.h5ad')
print(adata_rna)
print(adata_atac)

adata_atac.obs['cell_type'] = celltype
adata_atac.obs['domain_id'] = 0
adata_atac.obs['domain_id'] = adata_atac.obs['domain_id'].astype('category')
adata_atac.obs['source'] = 'ATAC'

adata_rna.obs['cell_type'] = celltype
adata_rna.obs['domain_id'] = 1
adata_rna.obs['domain_id'] = adata_rna.obs['domain_id'].astype('category')
adata_rna.obs['source'] = 'RNA'

adata = adata_atac.concatenate(adata_rna, join='inner', batch_key='domain_id')
adata.obs["celltype"]=adata.obs["cell_type"].copy()
adata.obs["BATCH"] = adata.obs["domain_id"].copy()

sc.pp.normalize_total(adata,target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, inplace=False, subset=True)
sc.pp.scale(adata)
sc.tl.pca(adata)
# sc.pp.neighbors(adata)
# sc.tl.umap(adata)
# sc.pl.umap(adata,color=["BATCH","celltype"],ncols=1)
# print(adata)
sce.pp.harmony_integrate(adata, 'BATCH')
sc.pp.neighbors(adata,use_rep="X_pca_harmony")
#sc.tl.louvain(adata,resolution=3.0)
sc.tl.umap(adata)
sc.pl.umap(adata,color=['BATCH','celltype'],ncols=1)

result is below

image

I felt confused with this problem, could you give me some suggestions?

yupines commented 1 year ago

Hi, thanks for your interest. It seems that you didn't use the common var.genes before using Harmony. You can refer to the PBMC benchmarking codes in R process (https://github.com/caokai1073/uniPort/blob/main/R%20process/ATAC/PBMC_Signac_ATAC.R). Thank you!