Closed amizeranschi closed 11 months ago
Hello again,
I have extended this exercise to three sets and I am running into the same issue. The results of Intervene
don't match those of other tools. Also, when computing the total length of regions from one set from the different elements (subsets) that should compose it, the values don't add up as they should, but they do add up in the case of other tools (bedops
and bedr
were tested).
Why is this the case?
Here is my code:
## set up a temporary directory
temp_dir=/path/to/temp_dir
mkdir $temp_dir
cd $temp_dir
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me1.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me2.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me3.bed
## run Intervene and save the BED files for the overlaps
intervene upset --output overlap_counts_3 --save-overlaps -i H3K4me1.bed H3K4me2.bed H3K4me3.bed
## compute the length of the UpSet diagram elements (subsets)
cd $temp_dir/overlap_counts_3/sets
ln -s $temp_dir/H3K4me1.bed H3K4me1.bed
ln -s $temp_dir/H3K4me2.bed H3K4me2.bed
ln -s $temp_dir/H3K4me3.bed H3K4me3.bed
## manually compute the intersection and set differences using BEDOPS
bedops --intersect H3K4me2.bed H3K4me1.bed H3K4me3.bed > BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed
bedops --intersect H3K4me2.bed H3K4me1.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_110_H3K4me2_H3K4me1.bed
bedops --intersect H3K4me2.bed H3K4me3.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_011_H3K4me2_H3K4me3.bed
bedops --intersect H3K4me1.bed H3K4me3.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_101_H3K4me1_H3K4me3.bed
bedops --everything BEDOPS_101_H3K4me1_H3K4me3.bed BEDOPS_011_H3K4me2_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me3.bed - > BEDOPS_001_H3K4me3.bed
bedops --everything BEDOPS_110_H3K4me2_H3K4me1.bed BEDOPS_011_H3K4me2_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me2.bed - > BEDOPS_010_H3K4me2.bed
bedops --everything BEDOPS_110_H3K4me2_H3K4me1.bed BEDOPS_101_H3K4me1_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me1.bed - > BEDOPS_100_H3K4me1.bed
## write the total lengths to a file
echo -e "Set element\t$Total length (bp)" > lengths.txt
for file_name in *.bed; do
subset_length=$(cat $file_name | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}')
echo -e "${file_name}\t${subset_length}" >> lengths.txt
done
cat lengths.txt
This is the result:
Set element length (bp)
001_H3K4me3.bed 1586566
010_H3K4me2.bed 1945128
011_H3K4me2_H3K4me3.bed 4000088
100_H3K4me1.bed 56723387
101_H3K4me1_H3K4me3.bed 1503887
110_H3K4me1_H3K4me2.bed 86126244
111_H3K4me1_H3K4me2_H3K4me3.bed 275359610
BEDOPS_001_H3K4me3.bed 19700303
BEDOPS_010_H3K4me2.bed 23749016
BEDOPS_011_H3K4me2_H3K4me3.bed 21965441
BEDOPS_100_H3K4me1.bed 290903957
BEDOPS_101_H3K4me1_H3K4me3.bed 11191800
BEDOPS_110_H3K4me2_H3K4me1.bed 54706745
BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed 50072261
H3K4me1.bed 406874763
H3K4me2.bed 150493463
H3K4me3.bed 102929805
Here, again, I would expect the total length of regions in H3K4me1.bed
to be equal to the sums of total lengths of regions from the files 100_H3K4me1.bed
, 101_H3K4me1_H3K4me3.bed
, 110_H3K4me1_H3K4me2.bed
and 111_H3K4me1_H3K4me2_H3K4me3.bed
. However, this is not true for the files produced by Intervene
, although it is true for those produced by bedops
.
Plotting the 3 BED files as a Venn diagram with the R package bedr
produces, again, identical results as those of bedops
:
@amizeranschi can you tell me what are you using as a back-end for bedr
? BEDTools
or BEDOPS
?
Can you try bedr
with feature = "interval"
instead of feature = "bp"
?
@amizeranschi And also you can set either regions of A or B to output in the intervene by providing wa
or wb
to --bedtools-options
This difference could be due to this. And I am not sure how feature = "bp"
works in bedr.
@asntech I think bedr
uses bedtools
as an engine, by default. I haven't changed this. The exact R code that I used is pasted above.
The reason I used feature = "bp"
in bedr
is because this answers exactly the question that I am looking for: namely, what is the length, in bp, of the overlapping regions between multiple BED files.
I have also computed this manually with bedops
, in two steps, as shown above. First, I computed the overlapping regions using bedops --intersect
and bedops --difference
. Then, I computed the total length (in bp) of the respective BED files, using bedops --merge
and GNU awk
.
Note: the --intersect
operation is commutative in bedops
, so the order of the input files does not matter:
$ bedops --intersect H3K27me3.bed H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
$ bedops --intersect H3K4me1.bed H3K27me3.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
I believe the BED intersection operation that you originally had in mind with Intervene
is more similar, if not identical, to the --element-of
operation in bedops
: https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#element-of-e-element-of. It would be very helpful if you could implement my use case as well.
Update: it looks like the bedtools intersect
and bedtools subtract
operations give the same results as the corresponding bedops
operations, when used with u=False
and v=False
:
$ bedops --intersect H3K27me3.bed H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
$ bedops --intersect H3K4me1.bed H3K27me3.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
$ bedtools intersect -a H3K27me3.bed -b H3K4me1.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
$ bedtools intersect -a H3K4me1.bed -b H3K27me3.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607
$ bedops --difference H3K27me3.bed BEDOPS_11_H3K27me3_H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
135056952
$ bedtools subtract -a H3K27me3.bed -b BEDOPS_11_H3K27me3_H3K4me1.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
135056952
So, it looks like implementing a way to set u=False
and v=False
for pybedtools
in Intervene
would solve my issue.
Update2:
After looking at it in more depth, things were a bit more complicated than I initially expected. However, I've managed to manually compute Intervene's UpSet diagram elements for my use case with bedtools
, in the same way I've done earlier with bedops
.
I'd be grateful if you could add support for automating this use case in Intervene
, as bedr
only supports Venn diagrams (no UpSet support) for a maximum of 5 BED files.
I'm including my updated code for 3 BED files below, in the hope that it will be helpful.
In a nutshell, the bedtools multiinter
command should be used for computing set intersections and cat *.bed | bedtools subtract
for set unions and differences.
## set up a temporary directory
temp_dir=/path/to/temp_dir
mkdir $temp_dir
cd $temp_dir
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me1.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me2.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me3.bed
## run Intervene and save the BED files for the overlaps
intervene upset --output overlap_counts_3 --save-overlaps -i H3K4me1.bed H3K4me2.bed H3K4me3.bed
## compute the length of the UpSet diagram elements (subsets)
cd $temp_dir/overlap_counts_3/sets
## prefix all the file names for Intervene's original output
prefix="INTERVENE_"
for filename in *.bed
do
mv "$filename" "$prefix$filename"
done
ln -s $temp_dir/H3K4me1.bed H3K4me1.bed
ln -s $temp_dir/H3K4me2.bed H3K4me2.bed
ln -s $temp_dir/H3K4me3.bed H3K4me3.bed
## manually compute the intersection and set differences using BEDOPS
bedops --intersect H3K4me2.bed H3K4me1.bed H3K4me3.bed > BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed
bedops --intersect H3K4me2.bed H3K4me1.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____110_H3K4me2_H3K4me1.bed
bedops --intersect H3K4me2.bed H3K4me3.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____011_H3K4me2_H3K4me3.bed
bedops --intersect H3K4me1.bed H3K4me3.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____101_H3K4me1_H3K4me3.bed
bedops --everything BEDOPS____101_H3K4me1_H3K4me3.bed BEDOPS____011_H3K4me2_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me3.bed - > BEDOPS____001_H3K4me3.bed
bedops --everything BEDOPS____110_H3K4me2_H3K4me1.bed BEDOPS____011_H3K4me2_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me2.bed - > BEDOPS____010_H3K4me2.bed
bedops --everything BEDOPS____110_H3K4me2_H3K4me1.bed BEDOPS____101_H3K4me1_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me1.bed - > BEDOPS____100_H3K4me1.bed
## manually compute the intersection and set differences using BEDTOOLS
bedtools multiinter -i H3K4me2.bed H3K4me1.bed H3K4me3.bed | grep "1,2,3" | cut -f 1-3 > BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed
bedtools multiinter -i H3K4me2.bed H3K4me1.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__110_H3K4me2_H3K4me1.bed
bedtools multiinter -i H3K4me2.bed H3K4me3.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__011_H3K4me2_H3K4me3.bed
bedtools multiinter -i H3K4me1.bed H3K4me3.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__101_H3K4me1_H3K4me3.bed
cat BEDTOOLS__101_H3K4me1_H3K4me3.bed BEDTOOLS__011_H3K4me2_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me3.bed -b - > BEDTOOLS__001_H3K4me3.bed
cat BEDTOOLS__110_H3K4me2_H3K4me1.bed BEDTOOLS__011_H3K4me2_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me2.bed -b - > BEDTOOLS__010_H3K4me2.bed
cat BEDTOOLS__110_H3K4me2_H3K4me1.bed BEDTOOLS__101_H3K4me1_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me1.bed -b - > BEDTOOLS__100_H3K4me1.bed
## write the total lengths to a file
echo -e "Set element\t$Total length (bp)" > lengths.txt
for file_name in *.bed; do
subset_length=$(cat $file_name | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}')
echo -e "${file_name}\t${subset_length}" >> lengths.txt
done
cat lengths.txt
This yields:
$ cat lengths.txt
Set element length (bp)
BEDOPS____001_H3K4me3.bed 19700303
BEDOPS____010_H3K4me2.bed 23749016
BEDOPS____011_H3K4me2_H3K4me3.bed 21965441
BEDOPS____100_H3K4me1.bed 290903957
BEDOPS____101_H3K4me1_H3K4me3.bed 11191800
BEDOPS____110_H3K4me2_H3K4me1.bed 54706745
BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed 50072261
BEDTOOLS__001_H3K4me3.bed 19700303
BEDTOOLS__010_H3K4me2.bed 23749016
BEDTOOLS__011_H3K4me2_H3K4me3.bed 21965441
BEDTOOLS__100_H3K4me1.bed 290903957
BEDTOOLS__101_H3K4me1_H3K4me3.bed 11191800
BEDTOOLS__110_H3K4me2_H3K4me1.bed 54706745
BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed 50072261
H3K4me1.bed 406874763
H3K4me2.bed 150493463
H3K4me3.bed 102929805
INTERVENE_001_H3K4me3.bed 1586566
INTERVENE_010_H3K4me2.bed 1945128
INTERVENE_011_H3K4me2_H3K4me3.bed 4000088
INTERVENE_100_H3K4me1.bed 56723387
INTERVENE_101_H3K4me1_H3K4me3.bed 1503887
INTERVENE_110_H3K4me1_H3K4me2.bed 86126244
INTERVENE_111_H3K4me1_H3K4me2_H3K4me3.bed 275359610
Also, regarding:
@amizeranschi can you tell me what are you using as a back-end for
bedr
?BEDTools
orBEDOPS
?
I reran bedr
with the three sample BED files mentioned above and saw the following within its output:
bedtools multiinter -i /tmp/Rtmpbdtb2K/i_5cef7f415bae.bed /tmp/Rtmpbdtb2K/_5cef45258826.bed /tmp/Rtmpbdtb2K/_5cef3fb99ba6.bed -names H3K4me1 H3K4me2 H3K4me3 -g /path/to/bedr/genomes/human.hg19.genome -filler .
So it looks like it is indeed using bedtools
, specifically the multiinter
command, which I've also tried running above.
Hello,
I've been testing
Intervene
on two of the sample tests. I've generated an UpSet plot and saved the corresponding BED files for the intersection and set differences and then computed the total length of the features from each of the files.I would expect that the summed total lengths would equal that of the initial files, but this is not the case. Any assistance in clarifying this issue would be much appreciated, as I was planning on using Intervene on a larger number of files.
I've also manually computed the intersection and set differences using
bedops
and, this time, the summed total lengths were as I'd expect. Finally, I've plotted a Venn diagram of the intersections using the R packagebedr
and its results match those obtained withbedops
. I'm including the diagram below.Below is a minimal reproducible example for what I did:
These are the results I got:
Here, I would expect that the total length of e.g.
H3K27me3.bed
would be equal to the summed total lengths of10_H3K27me3.bed
and11_H3K27me3_H3K4me1.bed
, but this wasn't the case. However, the results produced withbedops
did match this expectation.I have used the following R code to produce a Venn diagram using the R package
bedr
:Below is the resulting Venn diagram. The total bp length values are identical to those computed above by
bedops
.