calico / scBasset

Sequence-based Modeling of single-cell ATAC-seq using Convolutional Neural Networks.
Apache License 2.0
96 stars 13 forks source link

Question about Neighborhood Score #11

Closed fym0503 closed 2 years ago

fym0503 commented 2 years ago

image Hi, Could you please explain how the neigborhood socre is computed? I don't find the results in your code. Could you please point out how to compute these numbers ?

hy395 commented 2 years ago

Hi sorry I think I missed the reminder for this issue. Neighbor score is basically looking at the overlap b/w ATAC and RNA neighbors for each cell. The RNA neighbors is determined by a fixed scVI embedding. The ATAC neighbors is determined by different ATAC embedding methods. You can find more explanation in the manuscript as well.

You can find the code computing neighbor score: https://github.com/calico/scBasset/blob/main/scripts_for_review/multiome_mouse_brain/chromvar/eval.ipynb

fym0503 commented 2 years ago

Thanks for your answering ! Another question is that I try to replicate your result on Buen2018 dataset with your script in examples/batch_correction. But the result ARI is about 0.4, which has a gap with your result in the paper. Do you have some hyperparameter specification in the clustering algorithm?

hy395 commented 2 years ago

Hi!

Could you provide a bit more details in terms of how you reproducible the results?

The clustering performance in Fig.2a of the paper is without batch correction. The batch_correction script is for performing batch correction on the Buen2018 dataset. And as we show in Fig.3, once you perform batch correction, there is a trade-off between batch mixing and preserving biological variability, which means that all methods will suffer in clustering performance after batch correction.

For your question on computing ARI, we follow the same script as described in Chen et al. 2019, which you can find here: https://github.com/pinellolab/scATAC-benchmarking/.

Please let me know if you have additional questions!

fym0503 commented 2 years ago

Yes, Thanks for your patience and time :). I run the script here https://github.com/calico/scBasset/blob/main/examples/batch_correction/buen_batch_correction.ipynb.Training will stop at 1150 epochs because of early stopping. I also observe that after bc performance will degrade. My evaluation script is here, also borrowed from Chen et al. 2019.

def getNClusters(adata,n_cluster,range_min=0,range_max=3,max_steps=20):
    this_step = 0
    this_min = float(range_min)
    this_max = float(range_max)
    while this_step < max_steps:
        this_resolution = this_min + ((this_max-this_min)/2)
        sc.tl.louvain(adata,resolution=this_resolution)
        this_clusters = adata.obs['louvain'].nunique()

        if this_clusters > n_cluster:
            this_max = this_resolution
        elif this_clusters < n_cluster:
            this_min = this_resolution
        else:
            return(this_resolution, adata)
        this_step += 1

    print('Cannot find the number of clusters')
    print('Clustering solution from last iteration is used:' + str(this_clusters) + ' at resolution ' + str(this_resolution))
batch_file = "./examples/batch_correction/batch_m.csv"
import pandas as pd
batch_m = pd.read_csv(batch_file, index_col=0)

model = make_model_bc(32, adata.shape[0], batch_m,show_summary=False)
model.load_weights("./output_bc/best_model.h5")
proj = get_cell_embedding(model, True) # get_cell_embedding function
print(proj.shape)
ad = sc.AnnData(proj)

sc.pp.neighbors(ad, n_neighbors=15,use_rep='X')

getNClusters(ad,n_cluster=10)
louvain = []
for i in range(len(ad.obs['louvain'])):
    louvain.append(int(ad.obs['louvain'][i]))
adata.obs['louvain'] = pd.Series(louvain,index=adata.obs.index).astype('category')
kmeans = KMeans(n_clusters=10, random_state=2019).fit(proj)
adata.obs['kmeans'] = pd.Series(kmeans.labels_,index=adata.obs.index).astype('category')
ari_kmeans = adjusted_rand_score(adata.obs['label'], adata.obs['kmeans'])
admis = adjusted_mutual_info_score(adata.obs['label'], adata.obs['kmeans'],average_method='arithmetic')
print(ari_kmeans)
print(admis)

ari_kmeans = adjusted_rand_score(adata.obs['label'], adata.obs['louvain'])
admis = adjusted_mutual_info_score(adata.obs['label'], adata.obs['louvain'],average_method='arithmetic')
print(ari_kmeans)
print(admis)
fym0503 commented 2 years ago

One more question, as described in your paper "To benchmark our approach, we applied scBasset to three public datasets: scATAC of FACS-sorted hematopoietic differentiation (referred to as Buenrostro2018) with 2,000 cells15, 10x Multiome RNA+ATAC peripheral blood mononuclear cells (PBMCs) with 3,000 cells and 10x Multiome RNA+ATAC mouse brain with 5,000 cells. The first dataset provides ground-truth cell type labels from flow cytometry", how do you compute the ARI for other two datasets?

hy395 commented 2 years ago

Hi,

I looked more closely into the provided example in the tutorial, and I think I may have find out the reason for this. The main reason is that there are two versions of Buen2018 scATAC data from Chen et al. 2019 paper. And the two are different in how the peak atlas is created (one is created using bulk, on created in scATAC). In our analysis, we benchmarked on both versions. In the paper, we showed benchmark on the buen2018 data with peaks called on scATAC. In the tutorial, I was providing an older version of the data (with the bulk peak atlas). So if you are reproducing the buen2018 results with the anndata provided in the tutorial. you may notice that all methods are doing slightly worse.

I've just updated the tutorial, and now if you go to download.ipynb. You should be downloading the buen2018 anndata version that matches the one we used in the paper (which is called on scATAC rather than bulk). Can you run the batch correction code again on the updated anndata? You should see results matching the the performance in the paper (ARI~0.67 for scBasset-BC).

I've also updated the buen_batch_correction.ipynb, so you can find a downloadable trained scBasset-BC model, and the code I used for computing ARI.

As for your second question, for the two multiome datasets. We run scVI + leiden on the RNA to generate cell type clusters, and treat that as groundtruth when benchmarking ATAC.

Hope this helps! Please let me know if you have additional questions!

Best, Han

fym0503 commented 2 years ago

Thanks ! I don't have additional questions, thanks for your answers !