asntech / intervene

Intervene: a tool for intersection and visualization of multiple genomic region and gene sets
http://intervene.rtfd.io/
Other
132 stars 28 forks source link

Disagreement between Intervene and BEDtools #62

Open lovelycatZ opened 1 month ago

lovelycatZ commented 1 month ago

Hello,

Thank you for providing such a convenient tool!

I have installed Intervene via conda and tested with my own dataset hap1.bed and hap2.bed to generate a Venn diagram using command line intervene venn -i hap*.bed --names hap1,hap2 --save-overlaps. Finally, in the Intervene_result folder, there were three files 1)01_hap2.bed; 2)10_hap1.bed; 3)11_hap1_hap2.bed. For the third file 11_hap1_hap2.bed, it contained 5206 intervals same with the intersection part of the Venn diagram that were found in both A and B, right? But it was strange that when I use BEDtools with command line bedtools intersect -a hap1.bed -b hap2.bed, there were 5773 records in the result file.

How to explain the the difference?

Sarsaparella commented 1 month ago

I want to add that there's also difference in results depending on what bed file you put first. For example: With intervene venn -i Aaa.bed Bbb.bed I get 1930 intersecting intervals With intervene venn -i Bbb.bed Aaa.bed I get 1923 intersecting intervals And bedtools intersect -a Aaa.bed -b Bbb.bed | wc -l gives me 1947 intervals

lovelycatZ commented 1 month ago

Is there anybody can answer this question?

asntech commented 1 month ago

Hi @lovelycatZ @Sarsaparella thanks for your interest and posting this.

This is a known issue when doing genomic interval intersections. This is because, for genomic regions, you will not always have a one-to-one overlap. That's why you slight difference when changing the file order: -i Aaa.bed Bbb.bed != -i Aaa.bed Bbb.bed. For example, the first combination prints how many intervals in A overlap with those in B.

For example, if you've three BED files (a, b and c) and when you run the pybedtools:

(a + b + c) != (b + c + a)

This is also explained well enough here by the pybedtools developer as posted by @amizeranschi https://github.com/asntech/intervene/issues/27 https://github.com/daler/pybedtools/issues/45#issuecomment-2543863

Also, Intervene is using pybedtools in the background with u=True and v=True. You can run it with --bedtools-options u=False

I hope this helps.

amizeranschi commented 1 month ago

Hi @asntech

Please consider implementing bedtools multiinter as an alternative to bedtools intersect, which is not very intuitive to interpret and it leads to people coming here with questions like the one above.

Implementing bedtools multiinter would make the intersections commutative, which means that the order of the sets would not matter, so (a + b + c) == (b + c + a). This would also make the Venn and UpSet diagrams more intuitive to interpret.

You would just need to use this function from pybedtools: https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.multi_intersect.html

which is a wrapper for bedtools multiinter: https://bedtools.readthedocs.io/en/latest/content/tools/multiinter.html