Starlitnightly / omicverse

A python library for multi omics included bulk, single cell and spatial RNA-seq analysis.
https://starlitnightly.github.io/omicverse/
GNU General Public License v3.0
277 stars 32 forks source link

Exponential Increase in Data Size After Single-cell Data Processing with OV #37

Closed lisch7 closed 7 months ago

lisch7 commented 7 months ago

Hello,

I am encountering a significant issue with data size inflation after processing single-cell data using OV. Here's a detailed description of the problem:

  1. Initial Filtering: After the initial filtering process, the size of the .h5 file is approximately 31.35 GB. Preprocessing, Dimension Reduction, and Clustering: Following preprocessing, dimensionality reduction, and clustering, the file size increases to 83.49 GB.
  2. Cell Marker Identification and Contamination Removal: Using COSG for cell marker identification and removing contaminated cells, the file size grows to 92.67 GB.
  3. Redoing Dimension Reduction and Clustering: After redoing the dimension reduction and clustering on this data, the size dramatically escalates to 261.60 GB.

Could you please help me understand why there is such a drastic increase in file size at each step, especially after the last step of dimension reduction and clustering? Is this expected behavior, or might there be some inefficiencies or errors in my processing pipeline?

Here is the workflow I used in redoing dimension reduction and clustering:

## Step02
adata_counts=adata.raw.to_adata().copy()
ov.utils.retrieve_layers(adata_counts,layers='counts')
clusters_to_remove = ['10','11','21','28','30','31','33','37','40','42','45','46','48','51','53']
adata_counts = adata_counts[~adata_counts.obs['leiden_res1.0'].isin(clusters_to_remove)].copy()
adata_counts.write_h5ad("/data/data_mailab003/project/scrna-npc-22-tl/Downstream/integrate/process_data/delect1st.h5ad") 
## adata_counts now is around 92.67Gb

## Step03
adata = sc.read_h5ad("/data/data_mailab003/project/scrna-npc-22-tl/Downstream/integrate/process_data/delect1st.h5ad")
ov.utils.store_layers(adata, layers='counts')
del adata.uns["log1p"]
adata = ov.pp.preprocess(adata, mode="shiftlog|pearson", target_sum=10000, n_HVGs=2000,batch_key="study")
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
ov.pp.pca(adata, layer='scaled', n_pcs=50)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='scaled|original|X_pca',key_added="neighbors_original")
sc.tl.umap(adata,neighbors_key="neighbors_original")
adata.obsm['X_umap_original']=adata.obsm['X_umap']
ov.utils.cluster(adata,method="leiden",key_added="leiden_res1.0", neighbors_key="neighbors_original",resolution=1)

## now the adata is around 261.6GB

Any insights or suggestions to manage or reduce this data inflation would be greatly appreciated.

Thank you!

Starlitnightly commented 7 months ago

Hi,

The method ov.utils.store_layers(adata, layers='counts') will stored a entired raw anndata to adata.uns.

Sincerely,

Zehua