slsevilla commented 2 years ago

Currently annotation calling is one of the largest bottlenecks of the pipeline. It is currently split into several rules and accompanying scripts.

Rules

peak_Transcripts
peak_ExonIntron
peak_RMSK
peak_Transcripts
peak_junctions
peak_process
project_annotations

Scripts

The general workflow is to run each annotation type separately before merging into one RMD file. This requires a significant amount of time, and is generating individual jobs per sample per rule, which also utilizes more Biowulf resources than maybe necessary.

Goals for the re-write

Speed up performance
Reduce the number of input/output files required for execution
Transfer all file creation from R files to snakemake
Reduce the number of rules required without sacrificing speed considerably

slsevilla commented 2 years ago

Rule ExonIntron

Project info

Three projects were created from previous runs to complete benchmarking analysis

project1: mESC_clip_4_v2.0
project2: 8-09-21-HaCaT_fCLIP_v2.0
project3: mES_fclip_1_YL_011622_v2.0

File info

all projects are set-up with the following structure

├── proj_number
│   └── exp_output
│   └── input

Required inputs for one sample


└── input
└── 04_annotation
    ├── 01_project
    │   ├── 7SKRNA_Repeatmasker.bed
    │   ├── annotations.txt
    │   ├── DNA_Repeatmasker.bed
    │   ├── lincRNA_Gencode.bed
    │   ├── LINE\ SINE_Repeatmasker.bed
    │   ├── lncRNA_Gencode.bed
    │   ├── lncRNA_Gencode.txt
    │   ├── Low_complexity_Repeatmasker.bed
    │   ├── LTR_Repeatmasker.bed
    │   ├── miRNA_Gencode.bed
    │   ├── ncRNA_annotations.txt
    │   ├── Other_Repeatmasker.bed
    │   ├── ref_gencode.txt
    │   ├── rRNA_Custom.bed
    │   ├── rRNA_Gencode.bed
    │   ├── rRNA_Repeatmasker.bed
    │   ├── Satellite_Repeatmasker.bed
    │   ├── scRNA_Repeatmasker.bed
    │   ├── Simple_repeat_Repeatmasker.bed
    │   ├── sncRNA_Custom.bed
    │   ├── snoRNA_Gencode.bed
    │   ├── snRNA_Gencode.bed
    │   ├── srpRNA_Repeatmasker.bed
    │   ├── tRNA_Custom.bed
    │   ├── Unknown_Repeatmasker.bed
    │   └── yRNA_Repeatmasker.bed
    └── 02_peaks
        ├── Control1hr_Clip_ALLreadPeaks_AllRegions.txt
        └──  Control7hr_Clip_ALLreadPeaks_AllRegions.txt
└── config
   └── annotation_config.txt


- Expected outputs for one sample (Control1hr)

├── exp_output │ └── 04_annotation │ └── 02_peaks │ ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │ ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt │ ├── Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │ └──Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt


Project input/output files are located here:

/data/RBL_NCI/Wolin/Sam/annotation_testing


## Script calling
Script location:

/data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/


Example R script (SameStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt \ --anno_strand "SameStrand"


Example R script (OppoStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt \ --anno_strand "OppoStrand"

slsevilla commented 2 years ago

Create DAG of pipeline v2.0 for review

dag.pdf

wilfriedguiblet commented 2 years ago

Improved IE_calling speed in 05_peak_annotation_functions.R.

NCI-RBL / iCLIP

improve annotation in pipeline #125

Rule ExonIntron

Project info

File info