[Bug]: pipeline's checkpoints seem to be malfunctioning

cachedpotato commented 4 days ago

Contact Details

chiwonchung98@gmail.com

What happened?

I've been tinkering with this pipeline for a few days but I cannot get it to work. I'm fairly new to Snakemake so I'm not exactly sure what's the problem but the core problem is that Snakemake seems to skip some rules that are either directly or indirectly related to rules that are related to two checkpoints:

checkpoint mosaic_count
checkpoint filter_bad_cells_from_mosaic_count

I've used both snakemake v7 and v8, which lead to different errors.

using v7 gave me an error stating I do not have the input files of rule symlink_selected bam
using v8 gives me an error saying I do not have the files for checkpoint summarise_ploidy, specifically the output files of rule estimate_ploidy.

As stated before, both steps are in relation with checkpoints in count.smk. The pipeline apparently ignores rules that are not "directly" related to the checkpoint input/outputs. For example, running the command "snakemake --forceall -n --until symlink_selected_bam" gave me a DAG of 0 jobs.

P.S I've stated the version as 1.5.1 but the config file states I'm using 2.3.5.

Steps To Reproduce

Use Default Config
Run "snakemake -j 1"

Mosaicatcher-pipeline Version

1.5.1 (Default)

Command used

snakemake -j 1

How did you run the pipeline?

Conda only

What did you use to run the pipeline? (local execution, HPC, cloud)

local

Pipeline configuration file

I haven't touched anything in the configuration file. I got the test data with git --recursive and simply ran snakemake -j 1. Nevertheless here's the config file:

# --------------------------------------------------------
# MosaiCatcher pipeline Configuration
# --------------------------------------------------------

# MosaiCatcher version
version: 2.3.5

# Ashleys-QC pipeline version
ashleys_pipeline_version: 2.3.5

# Email for notifications about the pipeline's status
email: ""

# List of samples to process if multiple are specified
samples_to_process: []

# --------------------------------------------------------
# Data location & I/O
# --------------------------------------------------------

# Absolute path to the data location (modify as needed)
data_location: ".tests/data_CHR17"

# Directory to publish important data (e.g., stats, plots, counts). Leave empty if not required.
publishdir: ""

# --------------------------------------------------------
# Ashleys-QC upstream pipeline
# --------------------------------------------------------

# Legacy folder organisation: bam/ and selected/ folder - Mutually exclusive with ashleys_pipeline
input_bam_legacy: False

# Enable/Disable ashleys-qc-pipeline module loading to start pipeline from FASTQ files
ashleys_pipeline: False

# Target only output of ashleys-qc-pipeline
ashleys_pipeline_only: False

# Threshold for Ashleys-qc binary classification
ashleys_threshold: 0.5

# Enable or disable FastQC analysis
MultiQC: False

# Enable or disable ashleys-qc automatic classification
bypass_ashleys: False

# Whether to use paired-end data or not
paired_end: True

# --------------------------------------------------------
# Other modules
# --------------------------------------------------------

breakpointR: False
breakpointR_only: False
whatshap_only: False

# --------------------------------------------------------
# GENECORE Configuration
# --------------------------------------------------------

genecore: False
genecore_date_folder: ""
genecore_prefix: "/g/korbel/STOCKS/Data/Assay/sequencing/2023"
genecore_regex_element: "PE20"

# --------------------------------------------------------
# Reference Data Configuration
# --------------------------------------------------------

# Reference genome used by BWA to map FASTQ files
reference: "hg38"

# Reference assemblies data
references_data:
  "hg38":
    reference_fasta: "workflow/data/ref_genomes/hg38.fa"
    R_reference: "BSgenome.Hsapiens.UCSC.hg38"
    segdups: "workflow/data/segdups/segDups_hg38_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz

  "hg19":
    reference_fasta: "workflow/data/ref_genomes/hg19.fa"
    R_reference: "BSgenome.Hsapiens.UCSC.hg19"
    segdups: "workflow/data/segdups/segDups_hg19_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    # snv_sites_to_genotype: "/g/korbel2/weber/MosaiCatcher_files/snv_sites_to_genotype/ALL.chr1-22plusXY_hg19_sites.20170504.renamedCHR.vcf.gz"
    reference_file_location: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/analysisSet/hg19.p13.plusMT.no_alt_analysis_set.fa.gz
  "T2T":
    reference_fasta: "workflow/data/ref_genomes/T2T.fa"
    R_reference: "BSgenome.T2T.CHM13.V2"
    # R_reference_tarball: "/g/korbel2/weber/MosaiCatcher_files/EXTERNAL_DATA/R_reference//BSgenome.T2T.CHM13.V2_1.0.0.tar.gz"
    segdups: "workflow/data/segdups/segDups_T2T_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
  "mm10":
    reference_fasta: "workflow/data/ref_genomes/mm10.fa"
    R_reference: "BSgenome.Mmusculus.UCSC.mm10"
    segdups: "workflow/data/segdups/segDups_mm10_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz

# List of chromosomes to process
chromosomes:
  - chr1
  - chr2
  - chr3
  - chr4
  - chr5
  - chr6
  - chr7
  - chr8
  - chr9
  - chr10
  - chr11
  - chr12
  - chr13
  - chr14
  - chr15
  - chr16
  - chr17
  - chr18
  - chr19
  - chr20
  - chr21
  - chr22
  - chrX
  - chrY

# Specify any chromosomes to exclude from processing
chromosomes_to_exclude: []

# --------------------------------------------------------
# Counts Configuration
# --------------------------------------------------------

# Enable or disable multistep normalization analysis
multistep_normalisation: False

# Advanced parameters for multi-step normalisation
multistep_normalisation_options:
  min_reads_bin: 5
  n_subsample: 1000
  min_reads_cell: 100000

# Normalize or not mosaic counts using HGSVC-based reference file
hgsvc_based_normalized_counts: True

# Window size used by the mosaic binning algorithm
window: 200000

# Enable / Disable blacklisting of difficult mappable regions
blacklist_regions: True

# Enable or disable hand selection through the Jupyter Notebook
hand_selection: False

# --------------------------------------------------------
# SV Calling Configuration
# --------------------------------------------------------

# Enable / Disable multistep normalisation output for SV calling
multistep_normalisation_for_SV_calling: False

methods:
  lenient:
    method_name: "simpleCalls_llr4_poppriorsTRUE_haplotagsTRUE_gtcutoff0_regfactor6_filterFALSE"
    llr: 4
    poppriors: TRUE
    haplotags: TRUE
    gtcutoff: 0
    regfactor: 6
    filter: "FALSE"

  stringent:
    method_name: "simpleCalls_llr4_poppriorsTRUE_haplotagsFALSE_gtcutoff0.05_regfactor6_filterTRUE"
    llr: 4
    poppriors: TRUE
    haplotags: TRUE
    gtcutoff: 0.05
    regfactor: 6
    filter: "TRUE"

# joint segmentation
min_diff_jointseg: 0.1

# single segmentation
min_diff_singleseg: 0.5

# SCE cutoff
additional_sce_cutoff: 20000000

# SCE min distance
sce_min_distance: 500000

# --------------------------------------------------------
# Downstream modules configuration - genome browsing
# --------------------------------------------------------
genome_browsing_files_generation: False

# --------------------------------------------------------
# Downstream modules configuration - ArbiGent
# --------------------------------------------------------

# ArbiGent (Arbitrary-segments genotyping) mode of execution
arbigent: False
# Arbigent default BED file, can be changed and adapted based on user question
arbigent_bed_file: "workflow/data/arbigent/manual_segmentation.bed"

# Arbigent Data
arbigent_data:
  arbigent_mapability_track: workflow/data/arbigent/mapping_counts_allchrs_hg38.txt
  arbigent_mapability_track_h5: workflow/data/arbigent/mapping_counts_allchrs_hg38.h5

# --------------------------------------------------------
# Downstream modules configuration - scNOVA
# --------------------------------------------------------

# Enable / Disable scNOVA - require scNOVA_input_user/input_subclonality.txt file
scNOVA: False

# Scripts location
scNOVA_scripts:
  generate_CN_for_CNN: "workflow/scripts/scNOVA_scripts/generate_CN_for_CNN.R"
  generate_CN_for_chromVAR: "workflow/scripts/scNOVA_scripts/generate_CN_for_chromVAR.R"
  count_sort_annotate_geneid: "workflow/scripts/scNOVA_scripts/count_sort_annotate_geneid.R"
  count_sort_label: "workflow/scripts/scNOVA_scripts/count_sort_label.R"
  count_norm: "workflow/scripts/scNOVA_scripts/count_norm.R"
  feature_sc_var: "workflow/scripts/scNOVA_scripts/feature_sc_var.R"
  combine_features: "workflow/scripts/scNOVA_scripts/combine_features.R"
  annot_expressed: "workflow/scripts/scNOVA_scripts/annot_expressed.R"
  infer_diff_gene_expression: "workflow/scripts/scNOVA_scripts/infer_diff_gene_expression.R"
  count_sort_annotate_chrid_CREs: "workflow/scripts/scNOVA_scripts/count_sort_annotate_chrid_CREs.R"
  infer_diff_gene_expression_alt: "workflow/scripts/scNOVA_scripts/infer_diff_gene_expression_alt.R"

# --------------------------------------------------------
# Plotting options
# --------------------------------------------------------

# Split / Do not split QC counts plot into single individual images (limit jobs)
split_qc_plot: False

plottype_counts:
  - "raw"
  - "normalised"

plottype_consistency:
  - "byaf"
  - "bypos"

plottype_clustering:
  - "position"
  - "chromosome"

scTRIP_multiplot: False

# --------------------------------------------------------
# StrandScape
# --------------------------------------------------------

strandscape_labels_path: ""

# --------------------------------------------------------
# Internal Parameters
# --------------------------------------------------------

# Is the pipeline used as a submodule in mosaicatcher-pipeline?
mosaicatcher_pipeline: True

# Overwrite ASHLEYS PREDICTIONS for GitHub & smoke dataset purpose
use_light_data: False

# For snakemake linting
abs_path: "/"

# Option to display all potential options - listed in config_metadata.yaml
list_commands: False

Relevant log output

I have two log outputs: one using snakemake v7 and one using v8
1. snakemake v7

[Mon Nov 18 02:21:45 2024]
localrule symlink_selected_bam:
    input: .tests/data_CHR17/RPE-BM510/bam/BM510x04_PE20314.sort.mdup.bam, .tests/data_CHR17/RPE-BM510/bam/BM510x04_PE20314.sort.mdup.bam.bai
    output: .tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam, .tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam.bai
    log: .tests/data_CHR17/log/symlink_selected_bam/RPE-BM510/BM510x04_PE20314.log
    jobid: 126
    reason: Missing output files: .tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam.bai, .tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam
    wildcards: folder=.tests/data_CHR17, sample=RPE-BM510, cell=BM510x04_PE20314
    resources: tmpdir=/tmp

Waiting at most 5 seconds for missing files.
MissingOutputException in rule symlink_selected_bam in file /pipeline/workflow/rules/count.smk, line 156:
Job 126  completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
.tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam
.tests/data_CHR17/RPE-BM510/selected/BM510x04_PE20314.sort.mdup.bam.bai
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-11-18T022134.157956.snakemake.log

2. snakemake v8
[Mon Nov 18 02:23:48 2024]
localcheckpoint filter_bad_cells_from_mosaic_count:
    input: .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info_raw, .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.txt.raw.gz, .tests/data_CHR17/RPE-BM510/config/labels.tsv
    output: .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info, .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info_rm, .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.txt.filter.gz
    log: .tests/data_CHR17/log/filter_bad_cells_from_mosaic_count/RPE-BM510.log
    jobid: 2
    reason: Forced execution
    wildcards: folder=.tests/data_CHR17, sample=RPE-BM510
    resources: tmpdir=/tmp, mem_mb=2000, mem_mib=1908
DAG of jobs will be updated after completion.

df.shape[0] only
[Mon Nov 18 02:23:48 2024]
Finished job 2.
45 of 47 steps (96%) done
InputFunctionException in rule merge_strandphaser_vcfs in file /pipeline/workflow/rules/strandphaser.smk, line 78:
Error:
  FileNotFoundError: [Errno 2] No such file or directory: '.tests/data_CHR17/RPE-BM510/ploidy/ploidy_summary.txt'
Wildcards:
  folder=.tests/data_CHR17
  sample=RPE-BM510
Traceback:
  File "/pipeline/workflow/rules/aggregate_fct.smk", line 27, in aggregate_vcf_gz
  File "/miniconda/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
  File "/miniconda/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
  File "/miniconda/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
  File "/miniconda/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
  File "/miniconda/lib/python3.12/site-packages/pandas/io/common.py", line 873, in get_handle (rule merge_strandphaser_vcfs, line 168, /pipeline/workflow/rules/strandphaser.smk)

Anything else?

rules

Just in case this helps I've printed out the rule graph using snakemake -j 1 --forceall --rulegraph | dot -Tjpg rules.jpg

Also attached are the full log files:

log.txt log8.txt

weber8thomas commented 1 day ago

Hi @cachedpotato! Thanks for reaching out

P.S I've stated the version as 1.5.1 but the config file states I'm using 2.3.5.

Why specifying 1.5.1 ? That version is more than 2 years old

As stated before, both steps are in relation with checkpoints in count.smk. The pipeline apparently ignores rules that are not "directly" related to the checkpoint input/outputs. For example, running the command "snakemake --forceall -n --until symlink_selected_bam" gave me a DAG of 0 jobs.

The issue here is that you're using the testdataset without using the test config (.tests/config/simple_config.yaml) as stated in https://friendsofstrandseq.github.io/mosaicatcher-docs/latest/usage/

There is a condition in symlink_selected_bam where the file is copied instead of symlinked when using the test dataset, to prevent issues related to relative path definitions.

Let me know if that solve your issue :)

P.S: Say Hi to Hyobin from the lab 👋

cachedpotato commented 16 hours ago

Hello,

Thank you for the help! Yeah I figured my config file would be the problem and while I've only tested the ashleys-qc pipeline part it seems to be working. Using a different dataset gave me a different error regarding an R script in the ASHLEYS pipeline which I put a pull request of (not sure if my fix is appropriate though - I'm quite new to all of the programs/models used in this pipeline).

Why specifying 1.5.1 ? That version is more than 2 years old

The bug report form only goes up to 1.5.1 iirc, hence why I had to specify as so.

Again thanks for the help! I'll send her your regards.

EDIT: I'm trying the second part (after ASHLEYS) of the pipeline using the test data but using .tests/config/simple_config_mosaicatcher.yaml gives me "FileNotFoundError: [Errno 2] No such file or directory: '.tests/data_CHR17/RPE-BM510/ploidy/ploidy_summary.txt'" error. I ran the pipeline using the following code:

snakemake \
    --cores=6 \
    --configfile .tests/config/simple_config_mosaicatcher.yaml \
    --profile workflow/snakemake_profiles/mosaicatcher-pipeline/v8/local/conda/

Is there anything else I should do? I feel like I'm forgetting something but I can't seem to figure it out atm (I did remove the singularity argument since I don't use singularity, but ashleys pipeline worked fine without it).

Thanks and best regards.

friendsofstrandseq / mosaicatcher-pipeline