arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
937 stars 287 forks source link

paired-end splice-aware genomecov? -split does nothing when -pc in use #516

Open ewallace opened 7 years ago

ewallace commented 7 years ago

Hi, This is either a bug or a feature request for the (wonderful, much appreciated) bedtools suite.

I have paired-end reads, and I'd like to compute splice-aware genome coverage. However, with genome cov I can only seem to do (a) splice-aware coverage of each mate separately using the -split option, OR (b) paired-end but non-splice-aware coverage using the -pc option. I would like to be able to use both options to get a result like that in IGV's automatic coverage option.

Minimal reproducible example attached:

Example bam file is: exampleYDR064W.bam hisat2-aligned to SacCer3 genome, filtered to reads mapping to YDR064W

-split only aligns each member of a read pair to a separate strand

bedtools genomecov -ibam exampleYDR064W.bam -bga -split -strand + > split_plus.bedgraph bedtools genomecov -ibam exampleYDR064W.bam -bga -split -strand - > split_minus.bedgraph

-pc only aligns both members of a read to the same strand, but fills in gaps/introns

bedtools genomecov -ibam exampleYDR064W.bam -bga -pc -strand + > pc_plus.bedgraph bedtools genomecov -ibam exampleYDR064W.bam -bga -pc -strand - > pc_minus.bedgraph

-split and -pc together is identical to -pc only; I want it to drop gaps inside reads, like the .IGV-computed coverage

bedtools genomecov -ibam exampleYDR064W.bam -bga -split -pc -strand + > splitpc_plus.bedgraph bedtools genomecov -ibam exampleYDR064W.bam -bga -split -pc -strand - > splitpc_minus.bedgraph

example_ydr064w_igv_snapshot

It also looks as if -pc switches strand assignment relative to my expectations?

Data in here: paired-end_splice-aware_genomecov.zip

Apologies if this has previously been addressed and my googling failed to find it.

Thanks Edward

ghost commented 7 years ago

Perhaps you could something like this to subtract out regions between the splits after using the -pc:

bedtools genomecov -ibam exampleYDR064W.bam -bga -pc -strand + > pc.bedgraph bedtools genomecov -ibam exampleYDR064W.bam -bga -strand + > unspilt.bedgraph bedtools genomecov -ibam exampleYDR064W.bam -bga -strand + -split > split.bedgraph bedtools unionbedg -i pc.bedgraph unspilt.bedgraph split.bedgraph | awk '{print $1,$2,$3,$4-$5+$6}' > pe_SpliceAware.bedgraph

scottkall commented 6 years ago

Hope it isn't bad form to bump this -- this feature would really help me out. Thanks, @bjf79, for your hack -- I'll try it out, though I'm hoping for something simpler.

mschilli87 commented 6 years ago

@arq5x:

Is this the recommendet way to get coverage tracks for paired-end spliced data?

bedtools genomecov -ibam exampleYDR064W.bam -bga -pc -strand + > pc.bedgraph
bedtools genomecov -ibam exampleYDR064W.bam -bga -strand + > unspilt.bedgraph
bedtools genomecov -ibam exampleYDR064W.bam -bga -strand + -split > split.bedgraph
bedtools unionbedg -i pc.bedgraph unspilt.bedgraph split.bedgraph | awk '{print $1,$2,$3,$4-$5+$6}' > pe_SpliceAware.bedgraph

If so, could this be added to the documentation or even be implemented as a new flag for genomecov (as it is using bedtools only except for the little awk-part that shouldn't be hard to do internally, too).

I think this is an extremely important feature currently missing in bedtools as

  1. most sequencing data nowadays are stranded and
  2. many people nowadays map data splice-aware using STAR but
  3. bedtools genomecov created coverage tracks for these kind of data look 'unstranded' and simply do not reflect biology.

But maybe I am missing some corner cases here?


edit: Just for reference, a very old discussion on a similar issue.

ewallace commented 6 years ago

Hi, This still an important need, for my group. Is there anything I can do to help move it up the agenda? Sadly this is beyond my knowledge. If you know a software engineer who could fix this, I would happily look in to paying their time from my grant. Edward

JohnUrban commented 3 years ago

I was hoping to use -pc and -split together as well Was this feature ever addressed? Does that 4-step protocol work as intended?