Open orian116 opened 4 months ago
Hi @orian116
This is an issue with your genome annotation file (this should contain a column named "Gene").
Can you show the head of genome_annotation.tsv
All the best,
Seppe
Hi @SeppeDeWinter Here is the head and tail of my genome annotation file. It does have the 'Gene' column:
I have this exact same issue and also do have the "Gene" column in my gene_annotation.tsv
@SeppeDeWinter I have narrowed down the error a bit. It seems like in the gene_search_space.py function, the column names are present before running pr_gene_annotation = pr.PyRanges(gene_annotation.query("Gene in @scplus_genes").copy()) but not after. I will try converting to a pyranges object without this. What file is "@scplus_genes" referencing? Here's a snippet of the code. [
](url)
I tried defining scplus_genes before the line of code you highlighted. So far it appears to have solved the issue.
Thank you! That seemed to work but now I am getting a new error
Yea same here. No objects to concatenate... Stuck on the same thing.
Thank you! That seemed to work but now I am getting a new error
Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.
import scanpy as sc
adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")
And then just use the new file in your config script.
Hope this works for you too. If so, then your next error will probably be this:
If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus
Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:
assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }
The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R
Thank you! That seemed to work but now I am getting a new error
Okay, concatenate error solved for my part. I am just going to assume you processed your RNA-seq data in Seurat. I did this to fix the issue. It was damn near impossible to find the source of the error, but in the end it turned out to be an issue with how seurat exports h5ad files. For some reason it doesn't save the gene names in the right place, which means that the ACC_GEX.h5mu file you generate won't have the gene names stored correctly either.
import scanpy as sc
adata = sc.read("rna.h5ad") new_adata = adata.raw.to_adata() new_adata.var_names = adata.raw.var['_index'] #Or whatever your index is called new_adata.var_names.name = None new_adata.raw = new_adata sc.pp.normalize_total(new_adata, target_sum=1e4) sc.pp.log1p(new_adata) new_adata.write("rna_new.h5ad")
And then just use the new file in your config script.
Hope this works for you too. If so, then your next error will probably be this:
If you do get this error, updating dask to 2024.4.1 seems to fix it even though it warns you that it is not compatible with scenicplus
Also side note. If my assumption that you worked with seurat is correct, you may also have to go back into R and make sure to delete scale.data, because:
Add assay data
assay.group <- source[['assays']][[assay]] if (source$index()[[assay]]$slots[['scale.data']]) { x.data <- 'scale.data' raw.data <- 'data' } else { x.data <- 'data' raw.data <- if (source$index()[[assay]]$slots[['counts']]) { 'counts' } else { NULL } }
The data being called "raw.data" is not actually raw (Snippet from seurat-disk convert.R
How did you get to upgrade dask without upgrading pandas? When I run scenicplus after following your steps, I get an error for is_categorical being deprecated
Hmmm that sounds strange
I literally just did this: pip install dask==2024.4.1
And ran it again and then it worked
Thank you so much! I was able to get to the eRegulon dataframes
Describe the bug
Hello. I seem to be getting an error at the get_search_space step in the scenicplus pipeline with snakemake. Any immediate thoughts about why this could be? Should my region sets have genes in them?
Error output [Wed May 1 21:44:02 2024]
Datasnapshot This is my config file: input_data: cisTopic_obj_fname: "scATAC/cistopic_obj_withUMAP_string.pkl" GEX_anndata_fname: "scRNA/rna_batch3_coussens_updatedMeta.h5ad" region_set_folder: "scATAC/region_sets_all/" ctx_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.rankings.feather" dem_db_fname: "scATAC/mm10_screen_v10_clust.regions_vs_motifs.scores.feather" path_to_motif_annotations: "scATAC/motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl"
output_data:
output for prepare_GEX_ACC .h5mu
combined_GEX_ACC_mudata: "ACC_GEX.h5mu"
output for motif enrichment results .hdf5
dem_result_fname: "dem_results.hdf5" ctx_result_fname: "ctx_results.hdf5"
output html for motif enrichment results .html
output_fname_dem_html: "dem_results.html" output_fname_ctx_html: "ctx_results.html"
output for prepare_menr .h5ad
cistromes_direct: "cistromes_direct.h5ad" cistromes_extended: "cistromes_extended.h5ad"
output tf names .txt
tf_names: "tf_names.txt"
output for download_genome_annotations .tsv
genome_annotation: "genome_annotation.tsv" chromsizes: "chromsizes.tsv"
output for search_space .tsb
search_space: "search_space.tsv"
output tf_to_gene .tsv
tf_to_gene_adjacencies: "tf_to_gene_adj.tsv"
output region_to_gene .tsv
region_to_gene_adjacencies: "region_to_gene_adj.tsv"
output eGRN .tsv
eRegulons_direct: "eRegulon_direct.tsv" eRegulons_extended: "eRegulons_extended.tsv"
output AUCell .h5mu
AUCell_direct: "AUCell_direct.h5mu" AUCell_extended: "AUCell_extended.h5mu"
output scplus mudata .h5mu
scplus_mdata: "scplusmdata.h5mu"
params_general: temp_dir: "scenicPlus/scratch" n_cpu: 20 seed: 666
params_data_preparation:
Params for prepare_GEX_ACC
bc_transform_func: "\"lambda x: f'{x}'\"" is_multiome: False key_to_group_by: "celltype" nr_cells_per_metacells: 10
Params for prepare_menr
direct_annotation: "Direct_annot" extended_annotation: "Orthology_annot"
Params for download_genome_annotations
species: "mmusculus" biomart_host: "http://www.ensembl.org"
Params for search_space
search_space_upstream: "1000 150000" search_space_downstream: "1000 150000" search_space_extend_tss: "10 10"
params_motif_enrichment: species: "mus_musculus" annotation_version: "v10nr_clust" motif_similarity_fdr: 0.001 orthologous_identity_threshold: 0.0 annotations_to_use: "Direct_annot Orthology_annot" fraction_overlap_w_dem_database: 0.4 dem_max_bg_regions: 500 dem_balance_number_of_promoters: True dem_promoter_space: 1_000 dem_adj_pval_thr: 0.05 dem_log2fc_thr: 1.0 dem_mean_fg_thr: 0.0 dem_motif_hit_thr: 3.0 fraction_overlap_w_ctx_database: 0.4 ctx_auc_threshold: 0.005 ctx_nes_threshold: 3.0 ctx_rank_threshold: 0.05
params_inference:
Params for tf_to_gene
tf_to_gene_importance_method: "GBM"
Params regions_to_gene
region_to_gene_importance_method: "GBM" region_to_gene_correlation_method: "SR"
Params for eGRN inference
order_regions_to_genes_by: "importance" order_TFs_to_genes_by: "importance" gsea_n_perm: 1000 quantile_thresholds_region_to_gene: "0.85 0.90 0.95" top_n_regionTogenes_per_gene: "5 10 15" top_n_regionTogenes_per_region: "" min_regions_per_gene: 0 rho_threshold: 0.05 min_target_genes: 10
Version (please complete the following information): I am using scenicplus 1.0a1 from the main branch