NCI-RBL / iCLIP

RNA Biology Pipeline to Characterize protein-RNA Interactions
https://rbl-nci.github.io/iCLIP/
MIT License
4 stars 2 forks source link

improve annotation in pipeline #125

Open slsevilla opened 2 years ago

slsevilla commented 2 years ago

Currently annotation calling is one of the largest bottlenecks of the pipeline. It is currently split into several rules and accompanying scripts.

Rules

Scripts

The general workflow is to run each annotation type separately before merging into one RMD file. This requires a significant amount of time, and is generating individual jobs per sample per rule, which also utilizes more Biowulf resources than maybe necessary.

Goals for the re-write

  1. Speed up performance
  2. Reduce the number of input/output files required for execution
  3. Transfer all file creation from R files to snakemake
  4. Reduce the number of rules required without sacrificing speed considerably
slsevilla commented 2 years ago

Rule ExonIntron

Project info

Three projects were created from previous runs to complete benchmarking analysis

File info


- Expected outputs for one sample (Control1hr)

├── exp_output │   └── 04_annotation │   └── 02_peaks │   ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │   ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt │   ├── Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │   └──Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt


Project input/output files are located here:

/data/RBL_NCI/Wolin/Sam/annotation_testing


## Script calling
Script location:

/data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/


Example R script (SameStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt \ --anno_strand "SameStrand"


Example R script (OppoStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt \ --anno_strand "OppoStrand"

slsevilla commented 2 years ago

Create DAG of pipeline v2.0 for review

dag.pdf

wilfriedguiblet commented 2 years ago

Improved IE_calling speed in 05_peak_annotation_functions.R.