YeoLab / merge_peaks

Pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks
9 stars 7 forks source link

overlap_peakfi_with_bam.pl #6

Closed alexmondaini closed 3 years ago

alexmondaini commented 3 years ago

I'm getting the following issue

WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
reading peak file /var/lib/cwl/stgf7e182df-5cae-4505-b70d-9330e5e663fd/10249_sample3_10249_sample4_ready.split1.peakClusters.bed
now doing expt /var/lib/cwl/stg0607570c-da94-4f28-b3f1-fd8352700c76/10249_sample3_10249_sample4_ready.split1.bam
R1 strand error 403
Use of uninitialized value $strand in concatenation (.) or string at /opt/merge_peaks/bin/perl/overlap_peakfi_with_bam.pl line 402, <B> line 107.
R1 strand error 403

Command is:

cwltool --singularity /groups/cgsd/alexandre/eclip/workflow/merge_peaks/cwl/wf_full_IDR_pipeline_2inputs_scatter.cwl /groups/cgsd/alexandre/eclip/workflow/merge_peaks/analysis.yml

My yaml is:

#!/usr/bin/env eCLIP_full_IDR_pipeline_2inputs_scatter_singleNode

species: hg19
samples:
  - 
    - name: "10249_sample1_10249_sample2_merged_clip_and_input"
      ip_bam: 
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-0/execution/10249_sample1_10249_sample2_ready.bam
      input_bam:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-2/execution/10249_sample5_10249_sample6_ready.bam
      peak_clusters:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-Clipper/shard-0/execution/10249_sample1_10249_sample2_ready.bed
    - name: "10249_sample3_10249_sample4_merged_clip_and_input"
      ip_bam: 
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-1/execution/10249_sample3_10249_sample4_ready.bam
      input_bam:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-3/execution/10249_sample7_10249_sample8_ready.bam
      peak_clusters:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-Clipper/shard-1/execution/10249_sample3_10249_sample4_ready.bed
chrom_sizes:
  class: File
  path: /groups/cgsd/alexandre/eclip/workflow/chrom.sizes

And images used are:

brianyee_clipper:5d865bb.sif
brianyee_merge_peaks:0.1.0.sif
brianyee_samtools:1.6.sif
byee4 commented 3 years ago

R1 strand error 403 means that the script encountered a read whose flag is unexpected (403), which means that it isn't a primary alignment. Per the eCLIP protocol, only uniquely mapped reads should pass the genome mapping step (STAR --outFilterMultimapNmax 1)

alexmondaini commented 3 years ago

@byee4 thanks Brian. One curiosity and I close the issue, as per the eCLIP protocol you use samtools view -f 128 to output only the second read in each pair for use with a single stranded peak caller. Why is that ? Why not conserving both reads or taking the first one only ? Thanks a lot.

byee4 commented 3 years ago

Hi Alex, good question. I'll have to refer to the original paper which describes the paired-end protocol and is available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4887338/

cluster identification was performed on usable reads using CLIPper (available at https://github.com/YeoLab/clipper/releases/tag/1.0) with options –s hg19 –o –bonferroni –superlocal --threshold-method binomial --save-pickle, considering read 2 only (the read that is enriched for termination at the crosslink site). For visualization on the UCSC Genome Browser, all tracks were RPM (reads per million) normalized against the total number of usable reads in that dataset.

alexmondaini commented 3 years ago

Thanks a lot Brian. Actually I'm trying to adapt your workflow for a slightly different scenario we want to experiment. I have to admit that I kept both reads instead of only the second one this time and I got file results from merge_peaks,that seemed to work 👍 .

However when I look at the logs I see R1 strand error 99 and R1 strand error 83. I'm wondering if /opt/merge_peaks/bin/perl/overlap_peakfi_with_bam.pl can accept an analysis on both reads instead of only the second one ? And if not can at least the first read be used instead of the second ? or only the second one is the right thing to do ? Thanks.

byee4 commented 3 years ago

Sorry, I'm not really sure how to support other use cases, but the R1/2 strand error is simply checking for very specific flags. For eCLIP, we know what kinds of reads and what flags we can expect, so that's what we check for starting on line 308:

https://github.com/YeoLab/merge_peaks/blob/18933d4d4b00e97a8a0d155abbebad1fdbc254aa/bin/perl/overlap_peakfi_with_bam.pl#L308