aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
162 stars 27 forks source link

Snakemake pipeline stalls on 'download_genome_annotations' #357

Open jesswhitts opened 2 months ago

jesswhitts commented 2 months ago

Hello,

I'm using the development version of SCENIC+.

When running the snakemake pipeline, it seems to get stuck on 'download_genome_annotations'...

Contents of my Snakemake folder: -rw-r-----. 1 jwhittle stemcell 19698468969 Apr 18 12:38 ACC_GEX.h5mu drwxr-x---. 2 jwhittle stemcell 4096 Apr 15 14:18 config -rw-r-----. 1 jwhittle stemcell 6736337146 Apr 18 12:18 ctx_results.hdf5 -rw-r-----. 1 jwhittle stemcell 14855562 Apr 18 12:18 ctx_results.html -rw-r-----. 1 jwhittle stemcell 349 Apr 15 14:28 run_pipeline.sh -rw-r-----. 1 jwhittle stemcell 2450 Apr 18 12:38 scplus.3295014.err -rw-r-----. 1 jwhittle stemcell 84872 Apr 18 12:28 scplus.3295014.out drwxr-x---. 2 jwhittle stemcell 4096 Apr 15 14:18 workflow

Output file: 2024-04-18 14:52:08,976 Download gene annotation INFO Using genome: GRCh38.p12 2024-04-18 14:52:08,987 Download gene annotation INFO Found corresponding genome Id 51 on NCBI 2024-04-18 14:52:09,493 Download gene annotation INFO Found corresponding assembly Id 11968211 on NCBI 2024-04-18 14:52:09,997 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt

Error file: Assuming unrestricted shared filesystem usage for local execution. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 48 Rules claiming more threads will be scaled down. Job stats: job count


AUCell_direct 1 AUCell_extended 1 all 1 download_genome_annotations 1 eGRN_direct 1 eGRN_extended 1 get_search_space 1 motif_enrichment_dem 1 prepare_menr 1 region_to_gene 1 scplus_mudata 1 tf_to_gene 1 total 12

Select jobs to execute... Execute 1 jobs...

[Thu Apr 18 14:51:02 2024] localrule download_genome_annotations: output: genome_annotation.tsv, chromsizes.tsv jobid: 8 reason: Missing output files: genome_annotation.tsv, chromsizes.tsv resources: tmpdir=/tmp

Any thoughts on the cause?

Many thanks, Jess

jesswhitts commented 2 months ago

I found a workaround to this, the problem was due to the ftp download stalling, for some reason it was getting stuck without timing out or failing.

I edited the code in 'data_wrangling/gene_search_space.py' to use http instead by adding this after line 169: ncbi_assembly_report_url = ncbi_assembly_report_url.replace('ftp://', 'http://')

There's probably a more sensible way to implement this, but this runs fine for me now