LCR-BCCRC / lcr-modules

Collection of standard analytical pipelines for genomic and transcriptomic data
https://lcr-modules.rtfd.io
MIT License
17 stars 7 forks source link

Demo doesn't dry-run as documented #288

Closed Pranav-Garg closed 3 months ago

Pranav-Garg commented 1 year ago

The command nice snakemake --dry-run --use-conda all fails because there is no Snakefile in the demo directory. I then tried snakemake --dry-run --use-conda all -s genome_Snakefile.smk, which also fails with:

InputFunctionException in rule hardlink_download in file /projects/karsanlab/pgarg_dev/software/src/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk, line 584:
Error:
  Exception:  Could not find rule to generate   genomes/ grch37 / main_chromosomes/main_chromosomes.grch37.txt  .
Wildcards:
  genome_build=grch37
  suffix=main_chromosomes/main_chromosomes.grch37.txt
Traceback:
  File "/projects/karsanlab/pgarg_dev/software/src/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk", line 475, in hardlink_same_provider

(the exact rule that fails above is random)

Editing the file lcr-modules/workflows/reference_files/2.4/reference_files_header.smk, under function get_matching_download_rules, I changed

rule_names = [ r for r in dir(rules) if r.startswith("download_")]

to

rule_names = [ r for r in rules._rules.keys() if r.startswith("download_")]

and added

file = file.replace(' ', '')

to the same function since whitespaces were somehow being prepended to the paths. But this also fails. I think perhaps the developers would be better equipped to debug this issue than me.

Related to this, it would be helpful to see what directory structure and files the Reference Files Workflow generates, so that I can symlink existing downloads to it, and perhaps bypass the above issue.

Kdreval commented 1 year ago

Hi @Pranav-Garg ,

That's right, the Snakefile has been split into seq_type-specific workflows to better represent the real-world scenarios, and documentation has not been updated since then but is something we are planning to bring up to speed soon.

Thank you for reporting the error. Someone else also reported this issue and we are investigating. It appears that after certain version of snakemake the subworkflows are handled differently. Can you please:

  1. Post the version of snakemake you have in the environment
  2. Let us know if there is a Warning message printed to stdout right after you try running the snakemake command indicating that oncopipe was imported outside of snakemake.

Thanks

Pranav-Garg commented 1 year ago

Snakemake version: 7.32.4

Yes, there was such a warning. Full output below (I removed the full file paths):

Warning: The oncopipe package was imported outside of a snakefile. Most functions are designed to work within a snakefile. Some unexpected behaviour/errors might occur.
modules/slms_3/1.0/slms_3.smk:411: SyntaxWarning: invalid escape sequence '\s'
  str(rules._slms_3_annotate_strelka_gnomad.output.vcf),
modules/slms_3/1.0/slms_3.smk:438: SyntaxWarning: invalid escape sequence '\s'
  rules._starfish_all.input,
modules/slms_3/1.0/slms_3.smk:534: SyntaxWarning: invalid escape sequence '\#'
modules/slms_3/1.0/slms_3.smk:553: SyntaxWarning: invalid escape sequence '\#'
modules/slms_3/1.0/../../starfish/2.0/starfish.smk:236: SyntaxWarning: invalid escape sequence '\S'
  # Perform some clean-up tasks, including storing the module-specific
modules/pathseq/1.0/pathseq.smk:119: SyntaxWarning: invalid escape sequence '\>'
  R={input.genome_fa}
modules/pathseq/1.0/pathseq.smk:133: SyntaxWarning: invalid escape sequence '\>'
  log:
Building DAG of jobs...
Executing subworkflow reference_files.
workflows/reference_files/2.4/reference_files.smk:1460: SyntaxWarning: invalid escape sequence '\/'
workflows/reference_files/2.4/reference_files.smk:1480: SyntaxWarning: invalid escape sequence '\/'
Building DAG of jobs...
InputFunctionException in rule hardlink_download in file workflows/reference_files/2.4/reference_files_header.smk, line 584:
Error:
  Exception:  Could not find rule to generate   genomes/ grch37 / repeatmasker/repeatmasker.grch37.bed  .
Wildcards:
  genome_build=grch37
  suffix=repeatmasker/repeatmasker.grch37.bed
Traceback:
  File "workflows/reference_files/2.4/reference_files_header.smk", line 475, in hardlink_same_provider
focusonskills commented 6 months ago

Have you managed to solve the issue? I have encountered the same problem trying to run the demo data.

Kdreval commented 6 months ago

Hi all, Sorry for the delay in resolving this issue. @focusonskills , the problem here is associated with the way new snakemake version started to handle the subworkflows, which makes the new versions incompatible with lcr-modules. A solution is to use a locked conda environment from the following recipe: https://github.com/LCR-BCCRC/lcr-modules/blob/master/demo/env.yaml This is a copy of our production environment and it has been tested to resolve this issue on several systems and OS versions.

Please let us know if you will have any other questions.

focusonskills commented 6 months ago

Hi all, Sorry for the delay in resolving this issue. @focusonskills , the problem here is associated with the way new snakemake version started to handle the subworkflows, which makes the new versions incompatible with lcr-modules. A solution is to use a locked conda environment from the following recipe: https://github.com/LCR-BCCRC/lcr-modules/blob/master/demo/env.yaml This is a copy of our production environment and it has been tested to resolve this issue on several systems and OS versions.

Please let us know if you will have any other questions.

I've tried to create a locked enviroment with conda-lock using env.yaml under the demo folder. However I am still getting the same output as OP where it stuck at generating reference. Which snakemake version is actually required for the modules? I see these in the env.yaml :

lkhilton commented 6 months ago

When we generate the environment, snakemake --version returns 7.15.2. This version should work. What is the output of snakemake --version and pip show oncopipe when you activate your environment? Can you post the output of conda env export from the activated environment?

focusonskills commented 6 months ago

When we generate the environment, snakemake --version returns 7.15.2. This version should work. What is the output of snakemake --version and pip show oncopipe when you activate your environment? Can you post the output of conda env export from the activated environment?

snakemake --version gives 7.32.4 pip show oncopipe gives version 1.0.12 conda env export gives the following. Do you know why the snakemake version doesn't match with the one in conda enviroment?

name: opv12 channels:

lkhilton commented 6 months ago

This sounds like a problem with how your PATH environment variable is set. If you 'echo $PATH' with the conda environment activated, the path to the opv12 conda environment should be at the beginning of your path. If it's not, take a look at how you've modified your PATH in your .bashrc file.

focusonskills commented 6 months ago

echo $PATH returns the opv12 enviroment at the beginning. /home/bioinf/miniconda3/envs/opv12/bin:

snakemake --version returns 7.15.2 which now match with the enviroment.

However running nice snakemake --dry-run --use-conda all -s capture_Snakefile.smk still gives error below.

Building DAG of jobs... Executing subworkflow reference_files. Creating specified working directory /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/demo/reference. Building DAG of jobs... InputFunctionExceptionin line 583 of /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk: Error: AssertionError: The download_oncodrive_hg19_regions download rule doesn't have a provider param. Wildcards: genome_build=grch37 suffix=gnomad/af-only-gnomad.grch37.vcf Traceback: File "/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk", line 453, in hardlink_same_provider File "/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk", line 423, in get_matching_download_rules

lkhilton commented 6 months ago

Thanks for your patience @focusonskills. This was a known problem addressed in #310 and should be fixed if you pull from master again.

focusonskills commented 6 months ago

@lkhilton I've updated workflows/reference_files/2.4/reference_files.smk with the following.

rule download_oncodrive_refs:
    output:
        refs = "downloads/oncodrive/datasets/genomereference/{oncodrive_build}.master",
        stops = "downloads/oncodrive/datasets/genestops/{oncodrive_build}.master"
    params:
        outdir = "downloads/oncodrive/{version}/",
        provider = lambda w: config["genome_builds"][w.version]["provider"]

I've also checked that modules/oncodriveclustl/1.0/oncodriveclustl.smk match with the suggested correction below.

rule _oncodriveclustl_run:
    input:
        maf = str(rules._oncodriveclustl_format_input.output.maf),
        reference = lambda w: reference_files("downloads/oncodrive/{genome_build}/datasets/genomereference/" + ONCODRIVE_BUILD_DICT[w.genome_build] + ".master"),
        region = _get_region
    output:
        txt = CFG["dirs"]["oncodriveclustl"] + "{genome_build}/{sample_set}--{launch_date}/{md5sum}/{region}/elements_results.txt",
        tsv = CFG["dirs"]["oncodriveclustl"] + "{genome_build}/{sample_set}--{launch_date}/{md5sum}/{region}/clusters_results.tsv",
        png = CFG["dirs"]["oncodriveclustl"] + "{genome_build}/{sample_set}--{launch_date}/{md5sum}/{region}/quantile_quantile_plot.png"
    log:
        stdout = CFG["logs"]["oncodriveclustl"] + "{genome_build}/{sample_set}--{launch_date}/{md5sum}/{region}/oncodriveclustl.stdout.log",
        stderr = CFG["logs"]["oncodriveclustl"] + "{genome_build}/{sample_set}--{launch_date}/{md5sum}/{region}/oncodriveclustl.stderr.log"
    params:
        local_path = CFG["reference_files_directory"] + "{genome_build}/",
        build = lambda w: (w.genome_build).replace("grch37","hg19").replace("grch38","hg38"),
        command_line_options = CFG["options"]["clustl_options"] if CFG["options"]["clustl_options"] is not None else ""

However nice snakemake --dry-run --use-conda all -s capture_Snakefile.smk still returns AssertionError below.

Building DAG of jobs... Executing subworkflow reference_files. Creating specified working directory /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/demo/reference. Building DAG of jobs... InputFunctionExceptionin line 583 of /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk: Error: AssertionError: The download_oncodrive_hg19_regions download rule doesn't have a provider param. Wildcards: genome_build=grch37 suffix=main_chromosomes/main_chromosomes.grch37.txt Traceback: File "/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk", line 453, in hardlink_same_provider File "/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/workflows/reference_files/2.4/reference_files_header.smk", line 423, in get_matching_download_rules

lkhilton commented 6 months ago

Could you please pull from master one more time? The commit I made to fix the assertion error was overwritten in another branch. You should see these changes after pulling the latest changes.

focusonskills commented 6 months ago

Thank you! The reference files have been successfully generated with the update and there are no more assertion error. However I'm encountering some other errors downstream with Battenberg/ASCAT installation. Could you shed some light on the issue?

Activating conda environment: .snakemake/conda/1ea4ad9da9e4539afd010c34139325fa_ Downloading GitHub repo Crick-CancerGenomics/ascat@master Skipping 3 packages not available: GenomicRanges, IRanges, S4Vectors Installing 8 packages: data.table, doParallel, foreach, GenomicRanges, IRanges, RColorBrewer, S4Vectors, iterators Error: Failed to install 'ASCAT' from GitHub: (converted from warning) packages ‘GenomicRanges’, ‘IRanges’, ‘S4Vectors’ are not available (for R version 3.6.3) Execution halted [Tue Jun 4 14:45:07 2024] Error in rule _install_battenberg: jobid: 137 output: results/battenberg-1.2/00-inputs/battenberg_dependenciesinstalled.success log: results/battenberg-1.2/logs/launched-2024-06-04-at-14-44-48/00-inputs/input.log (check log file(s) for error message) conda-env: /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-modules/demo/.snakemake/conda/a85e0be326fb70ac2ccc6d95cb4ecce5 shell: R -q --vanilla -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")' >> results/battenberg-1.2/logs/launched-2024-06-04-at-14-44-48/00-inputs/input.log && ##move some of this to config? R -q --vanilla -e 'devtools::install_github("morinlab/battenberg")' >> results/battenberg-1.2/logs/launched-2024-06-04-at-14-44-48/00-inputs/input.log && ##move some of this to config? touch results/battenberg-1.2/00-inputs/battenberg_dependencies_installed.success (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Kdreval commented 5 months ago

Hi @focusonskills , I trust this issue is related to the devtools run of remotes under the hood, which recently changed how it finds non-CRAN packages. It basically now refuses to install anything from BioConductor complaining that the package was not found. I think adding the argument repos = BiocManager::repositories() to the devtools call should fix the problem, so the line R -q --vanilla -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")' should become R -q --vanilla -e 'devtools::install_github("Crick-CancerGenomics/ascat/ASCAT", repos = BiocManager::repositories())' . Can you please see if this fixes the error?

Thank you!

focusonskills commented 5 months ago

@Kdreval It is able to locate the packages now but it still failed to install some of the packages.

  • installing source package ‘data.table’ ... package ‘data.table’ successfully unpacked and MD5 sums checked using staged installation ** libs fread.c: In function 'freadMain': fread.c:1301:7: warning: ignoring return value of 'strtod', declared with attribute warn_unusedresult [-Wunused-result] (void)strtod(ch, &end); // careful not to let "" get to here as strtod considers "" numeric ^~~~~~ /mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-pipeline/lcr-modules/demo/.snakemake/conda/2239fcac3430199448647436028de9d0/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-condacos6-linux-gnu/bin/ld: cannot find -lgomp collect2: error: ld returned 1 exit status make: *** [/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-pipeline/lcr-modules/demo/.snakemake/conda/2239fcac3430199448647436028de9d0/lib/R/share/make/shlib.mk:6: data.table.so] Error 1 ERROR: compilation failed for package ‘data.table’
  • removing ‘/mnt/raid/Analysis/Ongoing/Haloplex/OldPipeline/lcr-pipeline/lcr-modules/demo/.snakemake/conda/2239fcac3430199448647436028de9d0_/lib/R/library/data.table’ Error: Failed to install 'ASCAT' from GitHub: (converted from warning) installation of package ‘data.table’ had non-zero exit status Execution halted
lkhilton commented 3 months ago

As of #327 we've updated the PyPi repository for Oncopipe and modified the demo/env.yaml file. You should now be able to install the correct version of Snakemake and all dependencies (including Oncopipe) with the command outlined in the README.

There is also a pending PR #326 that includes updates to the battenberg conda environment that should resolve the battenberg installation issues.

This should resolve these issues, please let us know if there are any further stumbling blocks.