XingyanLiu / CAME

Cell-type Assignment and Module Extraction based on a heterogeneous graph neural network.
https://XingyanLiu.github.io/CAME
MIT License
12 stars 6 forks source link

Error when 'pipeline.preprocess_unaligned' #21

Closed LalicJ closed 1 year ago

LalicJ commented 1 year ago

Hi, I am trying to ran pipeline with my own data. I got an error, which is below. Here is the code I ran:

came_inputs, (adata1, adata2) = pipeline.preprocess_unaligned(
    adatas,
    key_class=key_class1,
    use_scnets=use_scnets,
    ntop_deg=ntop_deg,
    ntop_deg_nodes=ntop_deg_nodes,
    node_source=node_source,
)

This is the error I got: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/came-0.1.9-py3.9.egg/came/pipeline.py", line 938, in preprocess_unaligned adata1 = pp.quick_preprocess(adatas[0], **params_preproc) File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/came-0.1.9-py3.9.egg/came/utils/preprocess.py", line 1923, in quick_preprocess sc.pp.highly_variable_genes( File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 431, in highly_variable_genes df = _highly_variable_genes_single_batch( File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 215, in _highly_variable_genes_single_batch df['mean_bin'] = pd.cut(df['means'], bins=n_bins) File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/pandas/core/reshape/tile.py", line 287, in cut fac, bins = _bins_to_cuts( File "/home/yali/softs/Anaconda3/lib/python3.9/site-packages/pandas/core/reshape/tile.py", line 413, in _bins_to_cuts raise ValueError( ValueError: Bin edges must be unique: array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]). You can drop duplicate edges by setting the 'duplicates' kwarg

The h5ad files were converted from seuratobject by SeuratDisk.

Please let me know any advice on how to resolve this. Thank you!

XingyanLiu commented 1 year ago

The problem is caused by "the h5ad files converted from seurat-object by SeuratDisk". CAME process the data from the raw-count matrices. So please use scanpy to construct the AnnData object from the raw-count matrices (e.g., the .mtx and .txt files)

XingyanLiu commented 1 year ago

You can also use the following R code to export the filtered scRNA-seq data into an .h5 file, which takes less time and space.

# R code
library(rhdf5)
library(Matrix)

save_h5mat = function(mat, fp_h5, feature_type, genome=""){
# =============== Test code ===============
# tmp = Seurat::Read10X_h5(fp_h5)
# all(tmp@x == mat@x)
# all(tmp@i == mat@i)
# all(tmp@p == mat@p)
#
# save sparse.mat ('dgCMatrix' format) into a h5 file
message(fp_h5)

h5createFile(fp_h5)
root = "matrix"
h5createGroup(fp_h5, root)

h5write(dim(mat), fp_h5, paste(root, "shape", sep='/'))
h5write(mat@x, fp_h5, paste(root, "data", sep='/'))
h5write(mat@i, fp_h5, paste(root, "indices", sep='/'))  # mat@i - 1 ?
h5write(mat@p, fp_h5, paste(root, "indptr", sep='/'))
h5write(colnames(mat), fp_h5, paste(root, "barcodes", sep='/'))

feat_root = paste(root, "features", sep='/')
h5createGroup(fp_h5, feat_root)

h5write(rownames(mat), fp_h5, paste(feat_root, "id", sep='/'))
h5write(rownames(mat), fp_h5, paste(feat_root, "name", sep='/'))

h5write(rep(feature_type, dim(mat)[1]),
      fp_h5, paste(feat_root, "feature_type", sep='/'))

h5write(rep("", dim(mat)[1]),
      fp_h5, paste(feat_root, "derivation", sep='/'))
h5write(rep(genome, dim(mat)[1]),  # "mm10"
      fp_h5, paste(feat_root, "genome", sep='/'))
h5write(c("genome", "derivation"),
      fp_h5, paste(feat_root, "_all_tag_keys", sep='/'))

h5closeAll()
message("Done!")
}

# save_h5mat_peak = function(mat, fp_h5, genome=""){
#   save_h5mat(mat, fp_h5, feature_type = "Peaks", genome = genome)
# }

save_h5mat_gex = function(mat, fp_h5, genome=""){
save_h5mat(mat, fp_h5, feature_type = "Gene Expression", genome = genome)
}

# save the raw-counts in a Seurat-object "seurat_obj":
mat = seurat_obj[["RNA"]]@counts
save_h5mat_gex(mat, "matrix.raw.h5", genome="")

# save the meta-data into a csv file:
meta_data = seurat_obj@meta.data
write.csv(meta_data, "metadata.csv")

And read the h5 file using Scanpy's build-in function:

# python-code
import pandas as pd
import scanpy as sc
fp_mat = 'matrix.raw.h5'
fp_meta = 'metadata.csv'
adata_raw = sc.read_10x_h5(fp_mat)
metadata = pd.read_csv(fp_meta, index_col=0)
for c in metadata.columns:
    adata_raw.obs[c] = metadata[c]

Hope it helps!

LalicJ commented 1 year ago

Thanks! That solved my problem!