[Bug]: chrY assigned a cumulative bin count of 0 due to off by 1 indexing error

Contact Details

petr.smirnov@embl.de

What happened?

I think this line: https://github.com/friendsofstrandseq/mosaicatcher-pipeline/blob/3422078530b95fb2b403bfa22d164f5892bedff1/workflow/scripts/plotting/plot-sv-calls_dev.R#L366

Is not doing what is intended. The code is likely meant to be one of the following:

 merge(seg, bins[, .N, by = chrom][, .(chrom, N = c(0, cumsum(N))[1:(.N)])], by = "chrom")

 merge(seg, bins[, .N, by = chrom][, .(chrom, N = c(0, cumsum(N)[1:(.N - 1)]))], by = "chrom")

Note the closing bracket placement on the second one for the c function.

I uncovered this while debugging an error in one of the samples, I don't think this is actually related to the error I am getting. However, R emits a warning when this line is run:

In as.data.table.list(jval, .named = NULL) :
  Item 2 has 23 rows but longest item has 24; recycled with remainder.

Currently, the code executes this:

> bins[, .N, by = chrom][, .(chrom, N = c(0, cumsum(N))[1:(.N - 1)])]

Resulting in output:

    chrom     N
 1:  chr1     0
 2: chr10  1245
 3: chr11  1914
 4: chr12  2590
 5: chr13  3257
 6: chr14  3829
 7: chr15  4365
 8: chr16  4875
 9: chr17  5327
10: chr18  5744
11: chr19  6146
12:  chr2  6440
13: chr20  7651
14: chr21  7974
15: chr22  8208
16:  chr3  8463
17:  chr4  9455
18:  chr5 10407
19:  chr6 11315
20:  chr7 12170
21:  chr8 12967
22:  chr9 13693
23:  chrX 14385
24:  chrY     0
    chrom     N

I don't think the value of 0 for chrY is what was intended in this case.

As an aside, I wonder if the sort order of the chromosomes matters in this case?

Steps To Reproduce

As far as I can tell, this will happen running the plot-sv-calls_dev.R script on any sample that has more than 1 chromosome.

FYI, I am using I am using Mosaic Catcher 2.0.1.

Mosaicatcher-pipeline Version

1.5.1 (Default)

Command used

The error occurred when manually executing the `plot-sv-calls_dev.R`, but was noticed when running: 

 --jobs 500 --config data_location=/g/huber/users/smirnov/StrandSeqData/LFS041 --profile workflow/snakemake_profiles/HPC/slurm_EMBL/ --singularity-args "-B /g:/g -B /scratch:/scratch" -c4

How did you run the pipeline?

Conda + Singularity

What did you use to run the pipeline? (local execution, HPC, cloud)

EMBL HPC

Pipeline configuration file

version: 2.0.1
ashleys_pipeline_version: 2.0.0
#######################################
#   MOSAICATCHER CONFIGURATION FILE   #
#######################################
# Option to display all potential options - listed in config_metadata.yaml
list_commands: False
# To be informed of pipeline status
email: ""
# Input BAM location
data_location: "/g/huber/users/smirnov/StrandSeqData/LFS041"
# Reference assembly selected
reference: "hg38"
# Enable / Disable multistep normalisation analysis
multistep_normalisation: False
# ArbiGent (Arbitrary-segments genotyping) mode of execution
arbigent: False
# Arbigent default BED file, can be changed and adapted based on user question
arbigent_bed_file: "workflow/data/arbigent/manual_segmentation.bed"
# Enable / Disable FastQC analysis
FastQC_analysis: False
# Plate orientation for GC analysis
plate_orientation: landscape
# Normalize or not mosaic counts
hgsvc_based_normalized_counts: True
# Mutually exclusive with ashleys_pipeline
input_bam_legacy: False
# Enable/Disable ashleys-qc-pipeline module loading to start pipeline from FASTQ files
ashleys_pipeline: True
# Enable / Disable comparison for each BAM file between folder name & SM tag
check_sm_tag: True
# Split / Do not split QC counts plot into single individual images (limit jobs)
split_qc_plot: True
# Chromosomes list to process
chromosomes:
  - chr1
  - chr2
  - chr3
  - chr4
  - chr5
  - chr6
  - chr7
  - chr8
  - chr9
  - chr10
  - chr11
  - chr12
  - chr13
  - chr14
  - chr15
  - chr16
  - chr17
  - chr18
  - chr19
  - chr20
  - chr21
  - chr22
  - chrX
  - chrY

chromosomes_to_exclude: []

# GENECORE
genecore: False
samples_to_process: []
genecore_date_folder: ""
genecore_prefix: "/g/korbel/shared/genecore"
## Mosaic bin window size
window: 200000
##############################
# Advanced configuration.    #
##############################
alfred_plots:
  - "dist"
  - "devi"

## Reference assembly data
references_data:
  "hg38":
    reference_fasta: "workflow/data/ref_genomes/hg38.fa"
    R_reference: "BSgenome.Hsapiens.UCSC.hg38"
    segdups: "workflow/data/segdups/segDups_hg38_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz
    # snv_sites_to_genotype: "/g/korbel2/weber/MosaiCatcher_files/snv_sites_to_genotype/ALL.chr1-22plusXY_GRCh38_sites.20170504.renamedCHR.vcf.gz"

  "hg19":
    reference_fasta: "workflow/data/ref_genomes/hg19.fa"
    R_reference: "BSgenome.Hsapiens.UCSC.hg19"
    segdups: "workflow/data/segdups/segDups_hg19_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/analysisSet/hg19.p13.plusMT.no_alt_analysis_set.fa.gz
    # snv_sites_to_genotype: "/g/korbel2/weber/MosaiCatcher_files/snv_sites_to_genotype/ALL.chr1-22plusXY_hg19_sites.20170504.renamedCHR.vcf.gz"
  "T2T":
    reference_fasta: "workflow/data/ref_genomes/T2T.fa"
    R_reference: "BSgenome.T2T.CHM13.V2"
    # TO CHANGE
    R_reference_tarball: "/g/korbel2/weber/MosaiCatcher_files/EXTERNAL_DATA/R_reference//BSgenome.T2T.CHM13.V2_1.0.0.tar.gz"
    segdups: "workflow/data/segdups/segDups_T2T_UCSCtrack.bed.gz"
    snv_sites_to_genotype: ""
    reference_file_location: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
    # snv_sites_to_genotype: "/g/korbel2/weber/MosaiCatcher_files/snv_sites_to_genotype/ALL.chr1-22plusXY_T2T_sites.20170504.renamedCHR.vcf.gz"

## Methods dictionnary
methods:
  lenient:
    method_name: "simpleCalls_llr4_poppriorsTRUE_haplotagsTRUE_gtcutoff0_regfactor6_filterFALSE"
    llr: 4
    poppriors: TRUE
    haplotags: TRUE
    gtcutoff: 0
    regfactor: 6
    filter: "FALSE"

  stringent:
    method_name: "simpleCalls_llr4_poppriorsTRUE_haplotagsFALSE_gtcutoff0.05_regfactor6_filterTRUE"
    llr: 4
    poppriors: TRUE
    haplotags: TRUE
    gtcutoff: 0.05
    regfactor: 6
    filter: "TRUE"

plottype_counts:
  - "classic"
  - "GC_corrected"

plottype_consistency:
  - "byaf"
  - "bypos"

plottype_clustering:
  - "position"
  - "chromosome"

## Breakpoint density
# joint segmentation
min_diff_jointseg: 0.1
# single segmentation
min_diff_singleseg: 0.5
# SCE cutoff
additional_sce_cutoff: 20000000
# SCE min distance
sce_min_distance: 500000

# ashleys-qc pipeline arguments
mosaicatcher_pipeline: True
hand_selection: False
use_light_data: False
ashleys_threshold: 0.5

# Others
abs_path: "/"
# CURRENTLY DISABLED

### Modes ["count", "segmentation", "mosaiclassifier"] [CURRENTLY DISABLED]
# mode: "mosaiclassifier"
### Plot enabled [True] or disabled [False]
# plot: False [CURRENTLY DISABLED]

arbigent_data:
  arbigent_mapability_track: workflow/data/arbigent/mapping_counts_allchrs_hg38.txt
  arbigent_mapability_track_h5: workflow/data/arbigent/mapping_counts_allchrs_hg38.h5

# If specified, will copy important data (stats, plots, counts file) to a second place
publishdir: ""

scNOVA: False
scNOVA_scripts:
  generate_CN_for_CNN: "workflow/scripts/scNOVA_scripts/generate_CN_for_CNN.R"
  generate_CN_for_chromVAR: "workflow/scripts/scNOVA_scripts/generate_CN_for_chromVAR.R"
  count_sort_annotate_geneid: "workflow/scripts/scNOVA_scripts/count_sort_annotate_geneid.R"
  count_sort_label: "workflow/scripts/scNOVA_scripts/count_sort_label.R"
  count_norm: "workflow/scripts/scNOVA_scripts/count_norm.R"
  feature_sc_var: "workflow/scripts/scNOVA_scripts/feature_sc_var.R"
  combine_features: "workflow/scripts/scNOVA_scripts/combine_features.R"
  annot_expressed: "workflow/scripts/scNOVA_scripts/annot_expressed.R"
  infer_diff_gene_expression: "workflow/scripts/scNOVA_scripts/infer_diff_gene_expression.R"
  count_sort_annotate_chrid_CREs: "workflow/scripts/scNOVA_scripts/count_sort_annotate_chrid_CREs.R"
  infer_diff_gene_expression_alt: "workflow/scripts/scNOVA_scripts/infer_diff_gene_expression_alt.R"

# Multi-step normalisation advanced parameters
multistep_normalisation_options:
  min_reads_bin: 5
  n_subsample: 1000
  min_reads_cell: 100000

Relevant log output

Error in `[.data.table`(seg, , `:=`(from = c(1, bps[1:(.N - 1)] + 1),  :
  Supplied 2 items to be assigned to group 15 of size 1 in column 'from'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
Calls: [ -> [.data.table
In addition: Warning message:
In as.data.table.list(jval, .named = NULL) :
  Item 2 has 23 rows but longest item has 24; recycled with remainder.
Execution halted

Anything else?

No response

friendsofstrandseq / mosaicatcher-pipeline