Open slsevilla opened 2 years ago
Three projects were created from previous runs to complete benchmarking analysis
all projects are set-up with the following structure
├── proj_number
│ └── exp_output
│ └── input
Required inputs for one sample
└── input
└── 04_annotation
├── 01_project
│ ├── 7SKRNA_Repeatmasker.bed
│ ├── annotations.txt
│ ├── DNA_Repeatmasker.bed
│ ├── lincRNA_Gencode.bed
│ ├── LINE\ SINE_Repeatmasker.bed
│ ├── lncRNA_Gencode.bed
│ ├── lncRNA_Gencode.txt
│ ├── Low_complexity_Repeatmasker.bed
│ ├── LTR_Repeatmasker.bed
│ ├── miRNA_Gencode.bed
│ ├── ncRNA_annotations.txt
│ ├── Other_Repeatmasker.bed
│ ├── ref_gencode.txt
│ ├── rRNA_Custom.bed
│ ├── rRNA_Gencode.bed
│ ├── rRNA_Repeatmasker.bed
│ ├── Satellite_Repeatmasker.bed
│ ├── scRNA_Repeatmasker.bed
│ ├── Simple_repeat_Repeatmasker.bed
│ ├── sncRNA_Custom.bed
│ ├── snoRNA_Gencode.bed
│ ├── snRNA_Gencode.bed
│ ├── srpRNA_Repeatmasker.bed
│ ├── tRNA_Custom.bed
│ ├── Unknown_Repeatmasker.bed
│ └── yRNA_Repeatmasker.bed
└── 02_peaks
├── Control1hr_Clip_ALLreadPeaks_AllRegions.txt
└── Control7hr_Clip_ALLreadPeaks_AllRegions.txt
└── config
└── annotation_config.txt
- Expected outputs for one sample (Control1hr)
├── exp_output │ └── 04_annotation │ └── 02_peaks │ ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │ ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt │ ├── Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt │ └──Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt
Project input/output files are located here:
/data/RBL_NCI/Wolin/Sam/annotation_testing
## Script calling
Script location:
/data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/
Example R script (SameStrand, proj_1):
Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt \ --anno_strand "SameStrand"
Example R script (OppoStrand, proj_1):
Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \ --rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \ --peak_type ALL \ --anno_anchor max_total \ --read_depth 3 \ --sample_id Control1hr_Clip \ --ref_species mm10 \ --anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \ --reftable_path /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \ --gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \ --intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \ --rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \ --tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \ --out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \ --out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt \ --anno_strand "OppoStrand"
Improved IE_calling speed in 05_peak_annotation_functions.R.
Currently annotation calling is one of the largest bottlenecks of the pipeline. It is currently split into several rules and accompanying scripts.
Rules
Scripts
The general workflow is to run each annotation type separately before merging into one RMD file. This requires a significant amount of time, and is generating individual jobs per sample per rule, which also utilizes more Biowulf resources than maybe necessary.
Goals for the re-write