aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
178 stars 28 forks source link

Error when running SnakeMake #468

Open gilgolan73 opened 1 week ago

gilgolan73 commented 1 week ago

Describe the bug Hello, I'm trying to run SCENIC+ using SnakeMake in a linux machine (centos 9), on the tutorial dataset. I ran scATAC-seq preprocessing in python (using pycistopic, using the tutorial: https://pycistopic.readthedocs.io/en/latest/notebooks/human_cerebellum.html) Then I ran the scRNAseq preprocessing in python (using the tutorial: https://scenicplus.readthedocs.io/en/latest/human_cerebellum_scRNA_pp.html#Preprocessing-the-scRNA-seq-data). I'm using the default config.yml file for SnakeMake, just changed the location of the input data. 1-2 minutes after running SnakeMake (as in the tutorial: https://scenicplus.readthedocs.io/en/latest/human_cerebellum.html#Running-SCENIC+), I receive an error (see below) which I believe is regarding to the different cell names between the scATACseq and scRNAseq datasets. Please let me know how to solve this issue. Thank you!

To Reproduce "snakemake --cores 20"

Error output "Assuming unrestricted shared filesystem usage for local execution. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 20 Rules claiming more threads will be scaled down. Job stats: job count


AUCell_direct 1 AUCell_extended 1 all 1 download_genome_annotations 1 eGRN_direct 1 eGRN_extended 1 get_search_space 1 motif_enrichment_cistarget 1 motif_enrichment_dem 1 prepare_GEX_ACC_multiome 1 prepare_menr 1 region_to_gene 1 scplus_mudata 1 tf_to_gene 1 total 14

Select jobs to execute... Execute 2 jobs...

[Tue Sep 17 16:12:50 2024] localrule prepare_GEX_ACC_multiome: input: /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/outs/cistopic_obj.pkl, /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/scRNAseq/adata.h5ad output: ACC_GEX.h5mu jobid: 2 reason: Missing output files: ACC_GEX.h5mu resources: tmpdir=/tmp

[Tue Sep 17 16:12:50 2024] localrule download_genome_annotations: output: genome_annotation.tsv, chromsizes.tsv jobid: 8 reason: Missing output files: chromsizes.tsv, genome_annotation.tsv resources: tmpdir=/tmp

2024-09-17 16:12:52,879 SCENIC+ INFO Reading cisTopic object. 2024-09-17 16:12:53,289 SCENIC+ INFO Reading gene expression AnnData. Traceback (most recent call last): File "/home/gilgolan/.local/bin/scenicplus", line 8, in sys.exit(main()) ^^^^^^ File "/home/gilgolan/.local/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main args.func(args) File "/home/gilgolan/.local/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 46, in command_prepare_GEX_ACC prepare_GEX_ACC( File "/home/gilgolan/.local/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 96, in prepare_GEX_ACC mdata = process_multiome_data( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/gilgolan/.local/lib/python3.11/site-packages/scenicplus/data_wrangling/adata_cistopic_wrangling.py", line 73, in process_multiome_data raise Exception( Exception: No cells found which are present in both assays, check input and consider using bc_transform_func! [Tue Sep 17 16:12:53 2024] Error in rule prepare_GEX_ACC_multiome: jobid: 2 input: /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/outs/cistopic_obj.pkl, /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/scRNAseq/adata.h5ad output: ACC_GEX.h5mu shell:

        scenicplus prepare_data prepare_GEX_ACC                 --cisTopic_obj_fname /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/outs/cistopic_obj.pkl                 --GEX_anndata_fname /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/scRNAseq/adata.h5ad                 --out_file ACC_GEX.h5mu                 --bc_transform_func "lambda x: f'{x}'"

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

2024-09-17 16:13:38,876 Download gene annotation INFO Using genome: GRCh38.p14 2024-09-17 16:13:38,878 Download gene annotation INFO Found corresponding genome Id 51 on NCBI 2024-09-17 16:13:39,381 Download gene annotation INFO Found corresponding assembly Id 11968211 on NCBI 2024-09-17 16:13:39,884 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt

Unhandeled exception occured <urlopen error [Errno 110] Connection timed out> Returning gene annotation without subestting for assembled chromosomesand converting to UCSC style. Please make sure that the chromosome namesin the returned object match with the chromosome names in the scplus_obj.Chromosome sizes will not be returned 2024-09-17 16:15:50,496 SCENIC+ INFO Chrosomome sizes was not found, please provide this information manually. 2024-09-17 16:15:50,497 SCENIC+ INFO Saving genome annotation to: genome_annotation.tsv Waiting at most 5 seconds for missing files. MissingOutputException in rule download_genome_annotations in file /home/gilgolan/bioinfo_analysis/scenicplus/tutorial2/scplus_pipeline/Snakemake/workflow/Snakefile, line 221: Job 8 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: chromsizes.tsv Removing output files of failed job download_genome_annotations since they might be corrupted: genome_annotation.tsv Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-09-17T161250.453181.snakemake.log WorkflowError: At least one job did not complete successfully."

Expected behavior I expect SnakeMake to run successfully on the tutorial pre-processed dataset.

Screenshots If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.

Version (please complete the following information): Python: 3.11.9 SCENIC+: 1.0a1 pyscenic 0.12.1+8.gd2309fe

Additional context When I look at the cell names in the adata object (scRNAseq) and cistopic object (scATACseq) they are different, also I have a different number of cells: "adata.obs Out[216]: VSN_cell_type ... pct_counts_mt CCCTCATAGACACTTA-1 GC ... 0.083148 GCCATTACACCTGCCT-1 ASTP ... 0.132626 ATTGCAGGTTGTGACA-1 MGL ... 1.447368 CTGTTGGAGGCATTAC-1 MOL_B ... 0.095579 CGAATCTAGCTTAGCG-1 MOL_B ... 0.061851 ... ... ... ... TACCTTAGTTACTAGG-1 MOL_B ... 0.058480 GTAGCTGTCATTACAG-1 AST_CER ... 0.188088 AGGCAGGTCGCGACAC-1 MOL_A ... 0.044703 GCATTGCCAAGACTCC-1 MOL_B ... 0.145530 ACATTAGTCCGCAAGC-1 AST_CER ... 0.068552 [2313 rows x 13 columns]

cistopic_obj.cell_data Out[218]: cisTopic_nr_frag ... pycisTopic_cca_Seurat_cell_type CACCTCAGTTGTAAAC-1-10x_multiome_brain 18300 ... AST TGACTCCTCATCCACC-1-10x_multiome_brain 100055 ... BG TTTCTCACATAAACCT-1-10x_multiome_brain 32192 ... GP GTCCTCCCACACAATT-1-10x_multiome_brain 88443 ... BG CTCCGTCCAGTTTGTG-1-10x_multiome_brain 131110 ... ENDO ... ... ... ... GCAGGTTGTCCAAATG-1-10x_multiome_brain 2770 ... MOL AAGCTCCCAGCACCAT-1-10x_multiome_brain 2180 ... MOL CAGAATCTCCTCATGC-1-10x_multiome_brain 1744 ... MOL TAGCCGGGTAACAGGG-1-10x_multiome_brain 2674 ... INH_SNCG GTGCGCAGTGCTTAGA-1-10x_multiome_brain 5859 ... GP [2845 rows x 42 columns]"