arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

Incompatibility of clusterBed with BED-converted BAMs? #57

Closed yarden closed 11 years ago

yarden commented 11 years ago

I never leave the house without Bedtools, and recently discovered the immensely useful feature clusterBed and am finding many applications for it. I'm having trouble applying it to BAM files, though, and wanted to see if there's a way around it or if this utility is incompatible with BAMs.

I have a BAM file generated by tagBam that I'd like to run clusterBed on. I convert it to BED with bamToBed and then sort it in preparation for clusterBed:

bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -

That works great, but the fourth field of IDs is already populated by the conversion. As far as I can tell from the docs, clusterBed uses the fourth field to group the intervals into clusters since it does not merge them into meta-intervals. It looks like clusterBed cannot work with this output:

# BED-converted BAM
# (Subset of "bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -")
$ cat test.bed
chr1    3197300 3197319 102919-4    1   +
chr1    3197300 3197319 302711-1    50  +
chr1    3197300 3197319 417919-1    3   +
chr1    3197300 3197319 578672-1    3   +
chr1    3197300 3197319 754316-1    3   +
chr1    3197301 3197319 560330-1    0   +
chr1    3197302 3197319 366543-1    0   +
chr1    3197307 3197323 125538-3    0   -
chr1    3197307 3197324 444150-1    1   -

Now clusterBed does not appear to correctly group the intervals:

$ clusterBed -i test.bed
chr1    3197300 3197319 102919-4    1   +   1
chr1    3197300 3197319 302711-1    50  +   1
chr1    3197300 3197319 417919-1    3   +   1
chr1    3197300 3197319 578672-1    3   +   1
chr1    3197300 3197319 754316-1    3   +   1
chr1    3197301 3197319 560330-1    0   +   1
chr1    3197302 3197319 366543-1    0   +   1
chr1    3197307 3197323 125538-3    0   -   1
chr1    3197307 3197324 444150-1    1   -   1

Is there a way around this? I'm basically trying to find a way to run clusterBed on BAM files, and am happy to convert them to BEDs in the process. Thanks.

arq5x commented 11 years ago

Hi Yarden,

I think I am being dense. To my eye, all of those intervals overlap and should thus be in a single cluster. The output reports them all to be in cluster number "1" (the last column). At which interval do you think a new cluster should be started?

yarden commented 11 years ago

Hi Aaron,

Ack, sorry, I was the one being dense: I misread the docs. It said: "In the example below, the 4th column is the cluster ID" and then showed a BED where the 4th column in the file is the cluster ID. Therefore, I was expecting the 4th field of my BED (i.e. the one containing 102919-4, 302711-1, ...) to contain the cluster ID, but I see that for my BED it made the cluster ID the fifth column of the file. My bad. Thanks, --Yarden

arq5x commented 11 years ago

Ah, I see. No worries...