aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
186 stars 29 forks source link

Clarification needed: Input_file for running snakemake pipieline? #409

Closed leahyekim closed 5 months ago

leahyekim commented 6 months ago

Hello! Thank you so much for developing such an amazing tool :)

I would like get some clarification on what input_data should be for running the snakemake pipeline (Tutorial Step 4: Run SCENIC+ using Snakemake).

When I first ran this, !bat scplus_pipeline/Snakemake/config/config.yaml bat command was not found (I don't have the error output anymore, sorry).

Instead, I was able to see the output when I did !cat scplus_pipeline/Snakemake/config/config.yaml

However, the path to my input_data were all empty.

So I ended up editing my config.yaml file by manually putting the path to each input files: The path to my config.yaml file is: /home1/yeeunk/scenicplus/scplus_pipeline/Snakemake/config/config.yaml

input_data:
  cisTopic_obj_fname: "/home1/yeeunk/scenic+_Mar2024/outs/cistopic_obj.pkl"
  GEX_anndata_fname: "/home1/yeeunk/scenic+_Mar2024/scRNA/adata.h5ad"
  region_set_folder: "/home1/yeeunk/scenic+_Mar2024/outs/region_sets"
  ctx_db_fname: "/home1/yeeunk/scenic+_Mar2024/outs/region_sets/multiome_P1coc.regions_vs_motifs.rankings.feather"
  dem_db_fname: "/home1/yeeunk/scenic+_Mar2024/outs/region_sets/multiome_P1coc.regions_vs_motifs.scores.feather"
  path_to_motif_annotations: "/home1/yeeunk/scenic+_Mar2024/outs/region_sets/motifs-v10-nr.mgi-m0.00001-o0.0.tbl"

Then I ran snakemake --cores 20 I just ran this because I thought the source and activating the conda environment is not necessary. I have my scenicplus environment activated in my jupyter notebook.

And I got this error (I copied and pasted the error at the bottom so that other people can search for it as well) scenicplus_1 scenicplus_2

I think there is something wrong with my .feather file which was generated after creating the custom cistarget database. The code that I used to generate that database is: Screen Shot 2024-05-24 at 1 31 01 PM

The output of that code was exactly the same as on the documentation here : https://scenicplus.readthedocs.io/en/latest/human_cerebellum_ctx_db.html

I also took a look at what .tbl file looks like (I'm working with a mouse dataset)

tbl_file_path = '/home1/yeeunk/scenicplus/aertslab_motif_colleciton/v10nr_clust_public/snapshots/motifs-v10-nr.mgi-m0.00001-o0.0.tbl'
df = pd.read_csv(tbl_file_path, sep='\t')
print(df.head())

Screen Shot 2024-05-24 at 1 36 38 PM

And my regions_vs_motifs.rankings.feather : Screen Shot 2024-05-24 at 1 37 30 PM

And my regions_vs_motifs.scores.feather : Screen Shot 2024-05-24 at 1 37 55 PM

It seems like my input_data is fine for running rule motif_enrichment_cistarget according to your raw code here: Line 51-54 of https://github.com/aertslab/scenicplus/blob/main/src/scenicplus/snakemake/Snakefile

Could you please help me understand why I am not getting the output file of rule motif_enrichment_cistarget? I appreciate any guidance to debug this. Thank you so much!!

Version (please complete the following information):

[Pasting the error output below as well so that other people could search it] Assuming unrestricted shared filesystem usage for local execution. Building DAG of jobs... Using shell: /bin/bash Provided cores: 20 Rules claiming more threads will be scaled down. Job stats: job count


AUCell_direct 1 AUCell_extended 1 all 1 download_genome_annotations 1 eGRN_direct 1 eGRN_extended 1 get_search_space 1 motif_enrichment_cistarget 1 motif_enrichment_dem 1 prepare_GEX_ACC_multiome 1 prepare_menr 1 region_to_gene 1 scplus_mudata 1 tf_to_gene 1 total 14

Select jobs to execute... Execute 1 jobs...

[Fri May 24 10:24:14 2024] localrule motif_enrichment_cistarget: input: /home1/yeeunk/scenic+_Mar2024/outs/region_sets, /home1/yeeunk/scenic+_Mar2024/outs/region_sets/multiome_P1coc.regions_vs_motifs.rankings.feather, /home1/yeeunk/scenic+_Mar2024/outs/region_sets/motifs-v10-nr.mgi-m0.00001-o0.0.tbl output: /home1/yeeunk/scenicplus/outs/ctx_results.hdf5, /home1/yeeunk/scenicplus/outs/ctx_results.html jobid: 9 reason: Missing output files: /home1/yeeunk/scenicplus/outs/ctx_results.hdf5 threads: 20 resources: tmpdir=/tmp/SLURM_23457528

2024-05-24 10:24:24,015 SCENIC+ INFO Reading region sets from: /home1/yeeunk/scenic+_Mar2024/outs/region_sets 2024-05-24 10:24:24,015 SCENIC+ INFO Reading all .bed files in: Topics_top_3k 2024-05-24 10:24:24,166 SCENIC+ INFO Reading all .bed files in: Topics_otsu 2024-05-24 10:24:24,531 SCENIC+ INFO Reading all .bed files in: DARs_cell_type Traceback (most recent call last): File "/home1/yeeunk/.conda/envs/scenicplus/bin/scenicplus", line 8, in sys.exit(main()) ^^^^^^ File "/home1/yeeunk/scenicplus/src/scenicplus/cli/scenicplus.py", line 1137, in main args.func(args) File "/home1/yeeunk/scenicplus/src/scenicplus/cli/scenicplus.py", line 386, in motif_enrichment_cistarget run_motif_enrichment_cistarget( File "/home1/yeeunk/scenicplus/src/scenicplus/cli/commands.py", line 149, in run_motif_enrichment_cistarget region_set_dict[key_name] = pr.read_bed( ^^^^^^^^^^^^ File "/home1/yeeunk/.conda/envs/scenicplus/lib/python3.11/site-packages/pyranges/readers.py", line 79, in read_bed first_start = open(f).readline().split()[1]


IndexError: list index out of range
[Fri May 24 10:24:25 2024]
Error in rule motif_enrichment_cistarget:
    jobid: 9
    input: /home1/yeeunk/scenic+_Mar2024/outs/region_sets, /home1/yeeunk/scenic+_Mar2024/outs/region_sets/multiome_P1coc.regions_vs_motifs.rankings.feather, /home1/yeeunk/scenic+_Mar2024/outs/region_sets/motifs-v10-nr.mgi-m0.00001-o0.0.tbl
    output: /home1/yeeunk/scenicplus/outs/ctx_results.hdf5, /home1/yeeunk/scenicplus/outs/ctx_results.html
    shell:

            scenicplus grn_inference motif_enrichment_cistarget                 --region_set_folder /home1/yeeunk/scenic+_Mar2024/outs/region_sets                 --cistarget_db_fname /home1/yeeunk/scenic+_Mar2024/outs/region_sets/multiome_P1coc.regions_vs_motifs.rankings.feather                 --output_fname_cistarget_result /home1/yeeunk/scenicplus/outs/ctx_results.hdf5                 --temp_dir tmp                 --species homo_sapiens                 --fr_overlap_w_ctx_db 0.4                 --auc_threshold 0.005                 --nes_threshold 3.0                 --rank_threshold 0.05                 --path_to_motif_annotations /home1/yeeunk/scenic+_Mar2024/outs/region_sets/motifs-v10-nr.mgi-m0.00001-o0.0.tbl                 --annotation_version v10nr_clust                 --motif_similarity_fdr 0.001                 --orthologous_identity_threshold 0.0                 --annotations_to_use Direct_annot Orthology_annot                 --write_html                 --output_fname_cistarget_html /home1/yeeunk/scenicplus/outs/ctx_results.html                 --n_cpu 20

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-24T102414.404731.snakemake.log
WorkflowError:
At least one job did not complete successfully.
DmitriiSeverinov commented 5 months ago

Hi @leahyekim ,

I think once I had similar issue, please check in the folder region_sets/DARs_cell_type or in region_sets/Topics_top_3k (or whichever name your folders have there) that you do not have empty .bed files. After I removed mine, this function worked. I had empty bed files because one of my cell types had too little cells, and no DARs were found for it. At least, this is my explanation to it.

Best, Dmitrii

leahyekim commented 5 months ago

Hi @DmitriiSeverinov,

Thank you for your reply. Yes, you're right. There were some empty .bed files in that folder!

After removing those empty .bed files, the Snakemake pipeline got past that error.

Thank you so much!