STOmics / SAW

GNU General Public License v3.0
132 stars 34 forks source link

How to process GEM file #142

Open sli231 opened 1 month ago

sli231 commented 1 month ago

Hello, I followed your tutorial and processed mouse olfactory bulb data using SAW. Afterward, I read the GEM file (Mouse_olfa_S1.tissue.gem) from 04.tissuecut as a TSV file using pandas. I manually binned the data according to different resolutions based on the x and y coordinates. When visualizing several marker genes (e.g., Pcp4, Mobp), their spatial expression appears incorrect and even completely random. However, when I read the h5ad file (bin200) from 05.spatialcluster and visualize it, the results are correct. What might be the issue? Thank you!

Clouate commented 1 month ago

@sli231 Hi, since I haven't seen your complete code, based on the previous similar issues, I could provide the following aspects for your reference. 1) When reading a gem, the coordinates of x and y need to add the offset X/Y recorded in the gem header. 2) What is recorded in the gem is the original expression of the gene. It may be expressed at many spots, but in some areas with significant expression, so the pattern may become obvious after normalization and other processing, such as the processed expression recorded in h5ad. 3) When binned the data, were the MIDCounts of all spots of the gene in a bin added up?

sli231 commented 1 month ago

Thank you for your answer! When I tried the first and second points you mentioned, the gene pattern was still incorrect. It was completely random, so shifting coordinates or normalization didn't work. Regarding the third point you mentioned, I feel my code should be summing the MIDCounts within the same bin. I'm not sure if there's an error in my code. Do you know of a more professional approach to process this file?

Here is my previous code for your reference: import scanpy as sc import numpy as np import scipy.sparse as sp import anndata as ad import pandas as pd data = pd.read_csv("Mouse_olfa_S1.tissue.gem", sep="\t", comment='#')

bin data

bin_size=200 data['x'] = (data['x'] // bin_size) bin_size data['y'] = (data['y'] // bin_size) bin_size result = data.groupby(['geneID', 'x', 'y'], asindex=False).agg({'MIDCount': 'sum', 'ExonCount': 'sum'}) result['spotID'] = result.groupby(['x', 'y']).ngroup() + 1 result['spotID'] = 'spot' + result['spotID'].astype(str)

transform to anndata

pivot_table = result.pivot_table(index='spotID', columns='geneID', values='MIDCount', aggfunc=np.sum, fill_value=0) sparse_matrix = sp.csr_matrix(pivot_table.values) xy_data = result.drop_duplicates(subset=['spotID'])[['spotID', 'x', 'y']].set_index('spotID') adata = ad.AnnData(X=sparse_matrix, obs=xy_data, var=pivot_table.columns.to_frame()) adata.obsm["spatial"]=adata.obs[["x","y"]].to_numpy()

visualization

sc.set_figure_params(figsize=[3,4]) sc.pl.embedding(adata, basis="spatial", color=["Pcp4","Mobp"])

Clouate commented 1 month ago

@sli231 Hi, in your code, result.pivot_table automatically reorders the index, but result.drop_duplicates doesn't sort the index. Therefore, the order of spatial coordinates in adata.obsm["spatial"] is no longer consistent with the matrix index order. To ensure the order is consistent, you could try the code result.drop_duplicates...set_index('spotID').sort_index() before transforming to anndata. Alternatively, you could also consider using other approaches to ensure the index order remains consistent.

Besides, STOmics team provides a comprehensive tool called Stereopy, which is compatible with gem/gef files. You could refer to the following website for more details. https://stereopy.readthedocs.io/en/latest/Tutorials/SquareBin_Clustering.html https://stereopy.readthedocs.io/en/latest/content/stereo.io.read_gem.html

sli231 commented 1 month ago

Thank you very much for your guidance. I'll continue studying how to use Stereopy. Additionally, I have a few small questions. I noticed that the GEM file processed by SAW contains both MIDCount and ExonCount columns. Is MIDCount generally used for downstream analysis? Is ExonCount related to spliced mRNA? Does the result from SAW include unspliced count and spliced count?

Clouate commented 1 month ago

Thank you very much for your guidance. I'll continue studying how to use Stereopy. Additionally, I have a few small questions. I noticed that the GEM file processed by SAW contains both MIDCount and ExonCount columns. Is MIDCount generally used for downstream analysis? Is ExonCount related to spliced mRNA? Does the result from SAW include unspliced count and spliced count?

Yes, MIDCount is generally used for downstream analysis as the expression level of a gene. For a certain read, if 50% of the read length could be mapped to a gene's exon region, it would be included in the ExonCount and the MIDCount. If mapped to the intron region, it would also be included in the MIDCount. Therefore, spliced/unspliced ​​data could be calculated based on gem/gef files. The generate_loom function in Stereopy could achieve this.

sli231 commented 4 weeks ago

Thank you again. I will study Stereopy carefully. I'll consult you if I have any questions.