aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
177 stars 28 forks source link

KeyError: 'Gene' in get_search_space #381

Open orian116 opened 4 months ago

orian116 commented 4 months ago

Describe the bug

Hello. I seem to be getting an error at the get_search_space step in the scenicplus pipeline with snakemake. Any immediate thoughts about why this could be? Should my region sets have genes in them?

Error output [Wed May 1 21:44:02 2024]

localrule get_search_space:
    input: ACC_GEX.h5mu, genome_annotation.tsv, chromsizes.tsv
    output: search_space.tsv
    jobid: 11
    reason: Missing output files: search_space.tsv; Input files updated by another job: chromsizes.tsv, genome_annotation.tsv
    resources: tmpdir=/tmp

2024-05-01 21:44:15,970 SCENIC+      INFO     Reading data
/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
Traceback (most recent call last):
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3800, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Gene'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/users/ostaplet/miniconda3/envs/scenic_devel/bin/scenicplus", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main
    args.func(args)
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 208, in search_space
    get_search_space_command(
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 661, in get_search_space_command
    search_space = get_search_space(
                   ^^^^^^^^^^^^^^^^^
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/scenicplus/data_wrangling/gene_search_space.py", line 297, in get_search_space
    if pr_gene_annotation.df['Gene'].isnull().to_numpy().any():
       ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/frame.py", line 3805, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/ostaplet/miniconda3/envs/scenic_devel/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    raise KeyError(key) from err
KeyError: 'Gene'
[Wed May  1 21:49:28 2024]
Error in rule get_search_space:
    jobid: 11
    input: ACC_GEX.h5mu, genome_annotation.tsv, chromsizes.tsv
    output: search_space.tsv
    shell:

        scenicplus prepare_data search_spance             --multiome_mudata_fname ACC_GEX.h5mu             --gene_annotation_fname genome_annotation.tsv             --chromsizes_fname chromsizes.tsv             --out_fname search_space.tsv             --upstream 1000 150000             --downstream 1000 150000             --extend_tss 10 10

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-01T214254.576799.snakemake.log
WorkflowError:
At least one job did not complete successfully.
**Expected behavior**
A clear and concise description of what you expected to happen.

Datasnapshot This is my config file: input_data: cisTopic_obj_fname: "scATAC/cistopic_obj_withUMAP_string.pkl" GEX_anndata_fname: "scRNA/rna_batch3_coussens_updatedMeta.h5ad" region_set_folder: "scATAC/region_sets_all/" ctx_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.rankings.feather" dem_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.scores.feather" path_to_motif_annotations: "scATAC/motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl"

output_data:

output for prepare_GEX_ACC .h5mu

combined_GEX_ACC_mudata: "ACC_GEX.h5mu"

output for motif enrichment results .hdf5

dem_result_fname: "dem_results.hdf5" ctx_result_fname: "ctx_results.hdf5"

output html for motif enrichment results .html

output_fname_dem_html: "dem_results.html" output_fname_ctx_html: "ctx_results.html"

output for prepare_menr .h5ad

cistromes_direct: "cistromes_direct.h5ad" cistromes_extended: "cistromes_extended.h5ad"

output tf names .txt

tf_names: "tf_names.txt"

output for download_genome_annotations .tsv

genome_annotation: "genome_annotation.tsv" chromsizes: "chromsizes.tsv"

output for search_space .tsb

search_space: "search_space.tsv"

output tf_to_gene .tsv

tf_to_gene_adjacencies: "tf_to_gene_adj.tsv"

output region_to_gene .tsv

region_to_gene_adjacencies: "region_to_gene_adj.tsv"

output eGRN .tsv

eRegulons_direct: "eRegulon_direct.tsv" eRegulons_extended: "eRegulons_extended.tsv"

output AUCell .h5mu

AUCell_direct: "AUCell_direct.h5mu" AUCell_extended: "AUCell_extended.h5mu"

output scplus mudata .h5mu

scplus_mdata: "scplusmdata.h5mu"

params_general: temp_dir: "scenicPlus/scratch" n_cpu: 20 seed: 666

params_data_preparation:

Params for prepare_GEX_ACC

bc_transform_func: "\"lambda x: f'{x}'\"" is_multiome: False key_to_group_by: "celltype" nr_cells_per_metacells: 10

Params for prepare_menr

direct_annotation: "Direct_annot" extended_annotation: "Orthology_annot"

Params for download_genome_annotations

species: "mmusculus" biomart_host: "http://www.ensembl.org"

Params for search_space

search_space_upstream: "1000 150000" search_space_downstream: "1000 150000" search_space_extend_tss: "10 10"

params_motif_enrichment: species: "mus_musculus" annotation_version: "v10nr_clust" motif_similarity_fdr: 0.001 orthologous_identity_threshold: 0.0 annotations_to_use: "Direct_annot Orthology_annot" fraction_overlap_w_dem_database: 0.4 dem_max_bg_regions: 500 dem_balance_number_of_promoters: True dem_promoter_space: 1_000 dem_adj_pval_thr: 0.05 dem_log2fc_thr: 1.0 dem_mean_fg_thr: 0.0 dem_motif_hit_thr: 3.0 fraction_overlap_w_ctx_database: 0.4 ctx_auc_threshold: 0.005 ctx_nes_threshold: 3.0 ctx_rank_threshold: 0.05

params_inference:

Params for tf_to_gene

tf_to_gene_importance_method: "GBM"

Params regions_to_gene

region_to_gene_importance_method: "GBM" region_to_gene_correlation_method: "SR"

Params for eGRN inference

order_regions_to_genes_by: "importance" order_TFs_to_genes_by: "importance" gsea_n_perm: 1000 quantile_thresholds_region_to_gene: "0.85 0.90 0.95" top_n_regionTogenes_per_gene: "5 10 15" top_n_regionTogenes_per_region: "" min_regions_per_gene: 0 rho_threshold: 0.05 min_target_genes: 10

Version (please complete the following information): I am using scenicplus 1.0a1 from the main branch

SeppeDeWinter commented 4 months ago

Hi @orian116

This is an issue with your genome annotation file (this should contain a column named "Gene").

Can you show the head of genome_annotation.tsv

All the best,

Seppe

orian116 commented 4 months ago

Hi @SeppeDeWinter Here is the head and tail of my genome annotation file. It does have the 'Gene' column:

Screenshot 2024-05-02 at 12 31 15 PM Screenshot 2024-05-02 at 12 31 38 PM
LukasOussoren commented 4 months ago

I have this exact same issue and also do have the "Gene" column in my gene_annotation.tsv

orian116 commented 4 months ago

@SeppeDeWinter I have narrowed down the error a bit. It seems like in the gene_search_space.py function, the column names are present before running pr_gene_annotation = pr.PyRanges(gene_annotation.query("Gene in @scplus_genes").copy()) but not after. I will try converting to a pyranges object without this. What file is "@scplus_genes" referencing? Here's a snippet of the code. [

Screenshot 2024-05-06 at 7 09 18 PM

](url)

LukasOussoren commented 4 months ago
Capture

I tried defining scplus_genes before the line of code you highlighted. So far it appears to have solved the issue.

orian116 commented 4 months ago

Thank you! That seemed to work but now I am getting a new error

Screenshot 2024-05-07 at 11 34 15 AM
LukasOussoren commented 4 months ago

Yea same here. No objects to concatenate... Stuck on the same thing.

LukasOussoren commented 4 months ago

Thank you! That seemed to work but now I am getting a new error Screenshot 2024-05-07 at 11 34 15 AM

Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.

import scanpy as sc

adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")

And then just use the new file in your config script.

Hope this works for you too. If so, then your next error will probably be this: image

If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus

Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:

Add assay data

assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }

The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R

Raghav1881 commented 4 months ago

Thank you! That seemed to work but now I am getting a new error Screenshot 2024-05-07 at 11 34 15 AM

Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.

import scanpy as sc

adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")

And then just use the new file in your config script.

Hope this works for you too. If so, then your next error will probably be this: image

If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus

Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:

Add assay data

assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }

The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R

How did you get to upgrade dask without upgrading pandas? When I run scenicplus after following your steps, I get an error for is_categorical being deprecated

LukasOussoren commented 4 months ago

Hmmm that sounds strange

I literally just did this: pip install dask==2024.4.1

And ran it again and then it worked

orian116 commented 3 months ago

Thank you so much! I was able to get to the eRegulon dataframes