caokai1073 / uniPort

a unified single-cell data integration framework by optimal transport
MIT License
30 stars 3 forks source link

Impute Merfish up to 2000 genes #9

Open mortunco opened 1 year ago

mortunco commented 1 year ago

Hello,

Thank you very much for being super responsive to all github issues.

I successfully followed the Impute genes for MERFISH tutorial. In the tutorial we have 155 genes and use 122 for training and impute 33 to test.

scRNAseq data comes with 21043 genes. I was wondering if uniPort could impute the remaining genes of MERFISH data up to 21043? Doesnt have to be 21043 but lets say 2000.

OR

Is there a way to match cells in the MERFISH data with scRNAseq so that those cells in scRNASeq will be assigned with a spatial coordinate information.

Basically, I would like to map each MERFISH cell to scRNAseq so that I over come the limited gene number problem of MERFISH OR spatial information to scRNAseq data.

Thank you very much for your time and patience,

Best,

T.

caokai1073 commented 1 year ago

Hi, thank you for your valuable question. Of course you can impute the remaining genes, just as follows: train:

  1. encode and align your common Merfish and RNA genes in a latent space
  2. decode the common genes and RNA-specific genes through the latent space test: encode your MERFISH data and decoder it through the RNA-specific decoder Screenshot 2023-07-03 at 11 44 59

You can take the code as reference:

adata = up.Run(adatas=[adata_merfish, adata_rna], adata_cm=adata_cm) merfish_predict = up.Run(adata_cm=adata_merfish, out='predict', pred_id=1)

where 'adata_merfish' contains merfish and RNA common genes, and 'adata_rna' contains both common and rna-specific genes (you are interested).

caokai1073 commented 1 year ago

For your second question: Is there a way to match cells in the MERFISH data with scRNAseq so that those cells in scRNASeq will be assigned with a spatial coordinate information?

If you have spatial coordinate information from Merfish data, I think you can find Merfish neighbors for RNA data in the aligned latent space. Then, you can assign coordinate for RNA through its Merfish neighbors.

mortunco commented 1 year ago

Dear @caokai1073 . Thank you very much for your reply.

For now lets forget about the second question.

I am little bit confused about the code you shared and the general impute gene tutorial. (forgive my ignorance if its super obvious).

Below is my code to untile the up.Run steps.

#Load SC and filter
sc_adata= load_10x_sc_to_anndata()
sc.pp.filter_cells(sc_adata,min_genes=200)
sc.pp.filter_genes(sc_adata,min_cells=50)
sc_adata.var['mt'] = sc_adata.var_names.str.startswith('MT-') 
sc.pp.calculate_qc_metrics(sc_adata, qc_vars=['mt'],percent_top=None, log1p=True, inplace=True)
sc_adata=sc_adata[(sc_adata.obs.pct_counts_mt < 40) & (sc_adata.obs.log1p_total_counts >8.5) & (sc_adata.obs.log1p_total_counts >7),:]

# Load Xenium/Merfish and Filter
xen_adata = load_10x_xenium_to_anndata()
sc.pp.calculate_qc_metrics(xen_adata, percent_top=None, log1p=True, inplace=True)
sc.pp.filter_cells(xen_adata,min_genes=5)
sc.pp.filter_genes(xen_adata,min_cells=1600)
xen_adata = xen_adata[(xen_adata.obs.log1p_total_counts > 4.5) & (xen_adata.obs.n_genes_by_counts > 3), :]

# Got xenium specific genes that are not in SC 
xen_adata=xen_adata[:,[i for i in xen_adata.var.index if i in sc_adata.var.index]] ### take only common genes with sc data

### Split aside 80% for train, remaining for test 
random.seed(31)
train_percent=80
train_genes=[i for i in random.sample(xen_adata.var.index.to_list(),round(len(xen_adata.var.index.to_list())*train_percent/100))]
test_genes=[]
for i in xen_adata.var.index.to_list():
    if (i not in train_genes) and (i in sc_adata.var.index.to_list()):
        test_genes.append(i)

xen_adata.obs['domain_id'] = "zero"
xen_adata.obs['source'] = 'XENIUM'

xen_adata_train=xen_adata[:,train_genes]
xen_adata_test=xen_adata[:,test_genes]

sc_adata.obs['domain_id'] = "one"
sc_adata.obs['source'] = 'RNA'

### Great concat Adata object. This includes RNA + Xenium genes. 
adata_cm = xen_adata_train.concatenate(sc_adata, join='outer', batch_key='domain_id')

### normalize and log1p
sc.pp.normalize_total(adata_cm)
sc.pp.log1p(adata_cm)
up.batch_scale(adata_cm)
print(adata_cm)

sc.pp.normalize_total(xen_adata_train)
sc.pp.log1p(xen_adata_train)
up.batch_scale(xen_adata_train)
print(xen_adata_train)

sc.pp.normalize_total(sc_adata)
sc.pp.log1p(sc_adata)
up.batch_scale(sc_adata)
print(sc_adata)

In tutorial, https://uniport.readthedocs.io/en/latest/examples/MERFISH/MERFISH_impute.html,

Integration is done with

adata = up.Run(adatas=adatas, adata_cm=adata_cm, lambda_kl=5.0, model_info=True)

And prediction is done with

adata_predict = up.Run(adata_cm=spatial_data_partial, out='predict', pred_id=1)
print(np.shape(adata_predict.obsm['predict']))

I am little bit confused as we are creating this variable adata. But then dont use it for prediction. I am asking this because in your previous (comment), we are creating the adata again and predicting merfish. Is PyTorch somehow creating the model and reading it without we are giving it as input or is there a typo in the tutorial.

adata = up.Run(adatas=[adata_merfish, adata_rna], adata_cm=adata_cm)
merfish_predict = up.Run(adata_cm=adata_merfish, out='predict', pred_id=1)

Thank you very much for your help and time.

Best regards,

Tunc.

caokai1073 commented 1 year ago

Hi, the goal of 'adata = up.Run(adatas=[adata_merfish, adata_rna], adata_cm=adata_cm)' is to train the model parameters from initial state. Here we store the latent representations in 'adata' for downstream analysis, which has no connection with the imputation task. When we use out='predict', the model will not only encode the input into a latent space, but also decode the latent representations with decoder of modality 1 (which is rna here because we set 'adatas=[adata_merfish, adata_rna]'. That means merfish is modality 0 and rna is modality 1).