MAJOR BUG: reanalyze diffExp may discard significantly differentially expressed genes in the csv output

STOmics / SAW

GNU General Public License v3.0

145 stars 34 forks source link

MAJOR BUG: reanalyze diffExp may discard significantly differentially expressed genes in the csv output #164

Open melop opened 1 week ago

melop commented 1 week ago

Hello, after running saw reanalyze diffExp, I only saw 3000 genes in the output file *.find_marker_genes.csv. Is there a way to make it output all genes even if no differential expression is found? This would be important for some downstream analyses like GO enrichment etc. Thanks!

Clouate commented 1 week ago

Hi, we're sorry that there is no such parameter at present.. Because highly variable genes(3000 by default) will be selected before differential analysis, if you want to obtain information about all genes, you could read the h5ad file in the directory after diffExp analysis. For example, uns['DE_1_marker_genes']['mean_count'] in the h5ad stores the average expression of all genes in different regions.

melop commented 1 week ago

3000 appears to be somewhat arbitrary? If I have used the lasso tool to delineate 5 regions, how to obtain the gene counts for each region. In particular, is there a documentation on the data structure of the h5ad file? Thanks!

Clouate commented 1 week ago

Hi, we're sorry that there is no such parameter at present.. Because highly variable genes(3000 by default) will be selected before differential analysis, if you want to obtain information about all genes, you could read the h5ad file in the directory after diffExp analysis. For example, uns['DE_1_marker_genes']['mean_count'] in the h5ad stores the average expression of all genes in different regions.

Hi, Here is the code to get the gene count for each region

import scanpy as sc
data=sc.read_h5ad('area_diffexp/SN.bin100_1.0.h5ad')
data.uns['DE_1_marker_genes']['mean_count']

It will output:

Clouate commented 1 week ago

Hi, 3000 is just the default value in SAW. According to some spatial omics analysis software such as scanpy and actual test data, 3000 highly variable genes could be used for downstream feature analysis and reduce computing pressure. At the same time, if the user want to adjust the analysis parameters, we encourage user to use Stereopy for personalized analysis. https://stereopy.readthedocs.io/en/latest/index.html

If you want to obtain the gene expression levels in different regions, you could export lasso.geojson in StereoMap, and then use saw reanalyze lasso to obtain the expression matrix in each regions.

h5ad is the Anndata format, which records the gene expression, spatial coordinates, clustering results and other information of all spots. You could refer to the following URL https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html

melop commented 1 week ago

Ok thanks!

melop commented 1 week ago

A closer inspection reveals a more severe issue with the reanalyze diffExp function. Seurat analysis revealed a highly significant differential expression of a gene (FDR = 2.636392e-150) in the target tissue region. Exporting the mean counts from the h5ad file also clearly showed an upregulation of this gene:

AKA021889 0.659302325581397 3.7833836858005907 0.2049941927990706 0.4573110893032378 0.18079096045197815 0.5580693815987918 0.5719725818735707

Inspecting this gene in Stereomap also clearly showed a clear upregulation in the target region.

However, this gene is completely missing from the *.find_marker_genes.csv file.

This suggests that the SAW reanalyze diffExp pipeline may contain a serious bug, which would result in omitting significant results. I suggest submitting this bug for fixing.

Clouate commented 1 week ago

@melop Would you mind providing the tissue.gef and geojson files for us to troubleshoot?

melop commented 1 week ago

I sent these files to your email, please have a look, thanks!

Clouate commented 1 week ago

Hi, the SAW pipeline defaults to differential analysis based on 3,000 highly variable genes. These genes are not highly variable genes and therefore are not included in the differential expression gene analysis. This is a shortcoming of the SAW pipeline. We will discuss and correct it as soon as possible. Users could use Stereopy to adjust the parameters for DEG analysis. Thank you for your suggestion and sorry for the inconvenience caused to you.