STOmics / SAW

GNU General Public License v3.0
145 stars 34 forks source link

UMAP cellbin report informations #143

Open maximelepetit opened 3 months ago

maximelepetit commented 3 months ago

Hello,

1) I'd like to have a little more clarity on the QCs and parameters used to generate the cellbin UMAP on the SAW report.

I know stereopy is used, but I would like to know the values of the parameters (min/max count, max percent.mt, min/max feature) for the QC, I would like to know the number of principal components used for the neighborhood graph, and the embedding of the UMAP. I'd also like to know the resolution used for clustering.

On the report, for squarebin=200 a resolution of 1 is used, but for the cellCluster part the resolution parameter is not used as input to the function. Why not ? Illustration here :

spatialCluster:

# Run SAW spatialCluster
binSize=200
resolution=1.0
/usr/bin/time -v singularity exec ${sif} spatialCluster \
    -i ${outDir}/04.tissuecut/${SN}.tissue.gef \
    -s ${binSize} \
    -r ${resolution} \
    -o ${outDir}/05.spatialcluster/${SN}.bin${binSize}_${resolution}.spatial.cluster.h5ad

cellCluster :

/usr/bin/time -v singularity exec ${sif} cellCluster \
        -i ${outDir}/041.cellcut/${SN}.adjusted.cellbin.gef \
        -o ${outDir}/051.cellcluster/${SN}.adjusted.cell.cluster.h5ad

    /usr/bin/time -v singularity exec ${sif} cellCluster \
      -i ${outDir}/041.cellcut/${SN}.cellbin.gef \
      -o ${outDir}/051.cellcluster/${SN}.cell.cluster.h5ad

2) Another remarks/question, inside the singularity image, I only have access to the compiled binary file cell_cluster.pyc, How can I access the cell_cluster.py file?

Thanks in advance !

Bests,

Maxime

Clouate commented 3 months ago

Hi, because the STOmics team has been developing and improving cellbin analysis, and the cellbin results depend on the quality of the data and image, so only one standard result is provided in the previous pipeline. And, in our latest release of SAW v8, the cellCluster program could already support the input of resolution parameter. 1) About the parameters you asked, they are listed below. Of course, we recommend that you could refer to the parameters provided by Stereopy’s documents.

filter_cells: min_gene=1 other defaults
highly_variable_genes: min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=3000
pca: n_pcs=20
neighbors: n_pcs=30

2) We're sorry that the code file in .pyc format means that our source code is confidential and hope for your understanding. About the cell_cluster, it is the part of the cellbin tutorial part in the Stereopy's documents up to clustering, and output the result file by Stereopy's functions, st.io.update_gef and st.io.stereo_to_anndata. In addition, because the version of Stereopy and random seeds setting, the clustering results may have very slight differences.

maximelepetit commented 2 months ago

Hi, Thanks for the reply.

Regarding you're answer , i have some questions :

  1. In filter_cells: min_gene=1 other defaults , what do you mean by "other defaults"?

    • It means that min_counts, max_counts, max_genes and pct_counts_mt was set to "None" as described in the stereopy API ? StPipeline.filter_cells(min_counts=None, max_counts=None, min_genes=None, max_genes=None, pct_counts_mt=None, cell_list=None, filter_raw=True, excluded=False, inplace=True, **kwargs)
    • Or it means that min_counts=200, max_genes=2500, pct_counts_mt=5 as described in the cellbin tutorial ?
  2. In pca: n_pcs=20 neighbors: n_pcs=30 isn't it rather : pca: n_pcs=20 neighbors: n_pcs=20 because if you run PCA with 20 PC you only have access to 20 PC for neighbors ?

Maxime

Clouate commented 2 months ago

Hi, thanks for your correction. For your questions, 1) Defaults mean the values set in the Stereopy API 2) We're sorry, this is a bug in one of previous versions, but due to the characteristics of numpy arrays, selecting 30 pcs has the same result as selecting 20 pcs for neighbors

maximelepetit commented 2 months ago

Thanks !! Last question regarding the gene filtration : Do you filter gene based on the numbers of cells or counts ? If yes what parameter values are used?

Clouate commented 2 months ago

Thanks !! Last question regarding the gene filtration : Do you filter gene based on the numbers of cells or counts ? If yes what parameter values are used?

Do you mean the function StPipeline.filter_genes? No, we don't run this step in SAW pipeline.

maximelepetit commented 2 months ago

Thanks, following you're suggestions , I can't get the same UMAP.

Here the UMAP on the SAW report :

umap_lung_saw_report

Here the code used and below the UMAP that I obtained :

data_path = './041.cellcut/A02989D6.adjusted.cellbin.gef'
data = st.io.read_gef(file_path=data_path, bin_type='cell_bins')
data.tl.filter_cells(min_genes=1, inplace=True)
data.tl.raw_checkpoint()
data.tl.normalize_total()
data.tl.log1p()
data.tl.highly_variable_genes( min_mean=0.0125,max_mean=3,min_disp=0.5, n_top_genes=3000,res_key='highly_variable_genes')
data.tl.scale(max_value=10)
data.tl.pca(use_highly_genes=True,res_key='pca',n_pcs=20)
data.tl.neighbors( pca_res_key='pca', n_pcs=30, res_key='neighbors')
data.tl.umap(pca_res_key='pca',neighbors_res_key='neighbors', res_key='umap')

umap_lung_try

I missed something ? My stereopy version is 1.3.1

Clouate commented 2 months ago

I missed something ? My stereopy version is 1.3.1

I think the version of SAW you use is 7.1? If so, the version of Stereopy in SAW v7.1.2 is 0.14.0b1 (for SAW v7.0, it's 0.12.1). The version update of stereopy involves the update of umap functions, such as addition of thread and seed setting in st.tl.umap, of which the default method have been changed to single thread with the sacrifice of computational efficiency to ensure reproducibility of results.

maximelepetit commented 2 months ago

Yes i used SAW version 7.1 I'll update it later! Thanks