gao-lab / GLUE

Graph-linked unified embedding for single-cell multi-omics data integration
MIT License
382 stars 56 forks source link

scglue.models.fit_SCGLUE Error #80

Closed fe4960 closed 1 year ago

fe4960 commented 1 year ago

Hello,

Thanks a lot for developing this great package. I encountered some error when running scglue.models.fit_SCGLUE. The scglue version is 0.3.2, and I run it in the computational node with GPU.

I wonder if you could help fix this error. Thanks very much.

glue = scglue.models.fit_SCGLUE( ... {"rna": rna, "atac": atac}, guidance_hvf, ... fit_kws={"directory": "glue"} #,init_kws={"random_seed": 0}
... ) [INFO] fit_SCGLUE: Pretraining SCGLUE model... Traceback (most recent call last): File "", line 1, in File "software/anaconda3/envs/scvi-env/lib/python3.9/site-packages/scglue/models/init.py", line 204, in fit_SCGLUE pretrain = model(adatas, sorted(graph.nodes), **pretrain_init_kws) File "software/anaconda3/envs/scvi-env/lib/python3.9/site-packages/scglue/models/scglue.py", line 729, in init if idx[k].min() < 0: File "software/anaconda3/envs/scvi-env/lib/python3.9/site-packages/numpy/core/_methods.py", line 44, in _amin return umr_minimum(a, axis, None, out, keepdims, initial, where) ValueError: zero-size array to reduction operation minimum which has no identity

Here is the information rna and atac objects:

atac AnnData object with n_obs × n_vars = 245940 × 425581 obs: 'orig.ident', 'nCount_ATAC', 'nFeature_ATAC', 'total', 'duplicate', 'chimeric', 'unmapped', 'lowmapq', 'mitochondrial', 'passed_filters', 'cell_id', 'iscell_barcode', 'TSS_fragments', 'DNase_sensitive_region_fragments', 'enhancer_region_fragments', 'promoter_region_fragments', 'on_target_fragments', 'blacklist_region_fragments', 'peak_region_fragments', 'peak_region_cutsites', 'cellname', 'sample', 'Clusters40', 'leiden', 'domain', 'cell_type' var: 'n_cells', 'prop_shared_cells', 'variability_score', 'highly_variable', 'chrom', 'chromStart', 'chromEnd' uns: 'Clusters40_colors', 'leiden', 'neighbors', 'umap', 'scglue' obsm: 'X_lsi', 'X_umap' layers: 'counts' obsp: 'connectivities', 'distances' rna AnnData object with n_obs × n_vars = 192792 × 26829 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'sample', 'scpred_Astro', 'scpred_BC', 'scpred_Endo', 'scpred_AC', 'scpred_HC', 'scpred_RGC', 'scpred_Mic', 'scpred_MG', 'scpred_Rod', 'scpred_Cone', 'scpred_max', 'scpred_prediction', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.5', 'seurat_clusters', 'region', 'group', 'leiden', 'domain', 'cell_type' var: 'features', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'mean', 'std', 'chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts', 'gene_id', 'gene_version', 'gene_source', 'gene_biotype' uns: 'hvg', 'leiden', 'log1p', 'neighbors', 'pca', 'scpred_prediction_colors', 'umap', 'scglue__' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances'

fe4960 commented 1 year ago

Hi @Jeff1995

I found the reason is that all the values in atac.var["highly_variable"] are False. I checked the steps to preprocess atac object, and found initially atac.var["highly_variable"] contains "False" and "True" values. But after "guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)", the atac object only contains "False" values. Please see details below. Could you help suggest how to fix it? Thanks a lot!

print(atac.var["highly_variable"].value_counts())

False 340465 True 85116 Name: highly_variable, dtype: int64

atac.var["chrom"] = split.map(lambda x: x[0]) atac.var["chromStart"] = split.map(lambda x: x[1]).astype(int) atac.var["chromEnd"] = split.map(lambda x: x[2]).astype(int) atac.var.head()

print(atac.var["highly_variable"].value_counts())

False 340465 True 85116 Name: highly_variable, dtype: int64

guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)

print(atac.var["highly_variable"].value_counts())

False 425581 Name: highly_variable, dtype: int64

scglue.graph.check_graph(guidance, [rna, atac])

[INFO] check_graph: Checking variable coverage... [INFO] check_graph: Checking edge attributes... [INFO] check_graph: Checking self-loops... [INFO] check_graph: Checking graph symmetry... [INFO] check_graph: All checks passed!

print(atac.var["highly_variable"].value_counts())

False 425581 Name: highly_variable, dtype: int64

Jeff1995 commented 1 year ago

The scglue.genomics.rna_anchored_guidance_graph function propagates highly variable genes in the RNA modality to the ATAC modality based on the guidance graph, which in turn is constructed based on genomic overlap between ATAC peaks and RNA gene body + promoter regions by default.

So in principle, this can only happen if all highly variable RNA genes do not overlap any ATAC peaks in the gene body + promoter region, which should be rare. Could you check how many RNA genes were marked as "highly variable" in your data? Could it be that all RNA genes were marked as False? Also, for trouble shooting, you may also try setting extend_range to larger values like 150000 to see if that covers more ATAC peaks.

fe4960 commented 1 year ago

Thanks a lot for reply! I found the error is due to the chromosome names in gene annotation gtf file do not have "chr". But the ATAC peaks have "chr" for the coordinates. After modifing the gtf file, the error was fixed.