arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
929 stars 287 forks source link

Supporting other similarity measures than Jaccard #790

Open ivargr opened 4 years ago

ivargr commented 4 years ago

Hi!

I think it would be nice if Bedtools supported other measures than the Jaccard index for measuring the similarity between two tracks (bed files), and I am interested in submitting a pull request that includes functionality for computing other measures, such as the Forbes coefficient (which is quite similar to Jaccard, but has shown to be better to use in many cases).

Implementing Forbes in bedtools seems quite trivial -- I think it is just to copy all the jaccard-related files, make a new subcommand forbes and change one or two lines in the copied code. However, this would lead to a lot of duplicate code (between the jaccard and forbes subcommands). This might be OK when having only two similarity measures, but I think it would become a mess if more similarity metrcis are to be added. Thus, I think it would be better to make a new subcommand (e.g. bedtools similarity) which takes a parameter to specify the similarity to be used (this could be jaccard by default). The jaccard subcommand could be kept for backwards compatibility.

I guess my questions are:

PS: The motiviation for this is partly this paper (https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz083/5586919) of which I am a co-author.

arq5x commented 4 years ago

I think this is a very interesting idea. I will dig into this a bit next week and follow up soon.

arq5x commented 4 years ago

@ryanlayer and @brentp what do you think?

ryanlayer commented 4 years ago

Seems like adding a —similarity to bedtools intersect would be a better place to collect these

On Nov 30, 2019, at 11:34 AM, Aaron Quinlan notifications@github.com wrote:

 @ryanlayer and @brentp what do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

sandve commented 4 years ago

Would be great to reach a conclusion here:) As Ivar has not started any concrete implementation yet, I guess he would be open to any approach you would prefer (and I think he is still interested in implementing). I think it would be great for the field to have such flexibility available. Especially since we in the study mentioned by Ivar found that the choice of measure has a dramatic impact on relative similarity values (rankings etc) whenever there is a large variation in track size (genomic coverage). After looking for arguments in all directions, we also ended up with a surprisingly strong case for fold enrichment (Forbes) in favor of Jaccard (it was a bit unclear whether Forbes or Tetrachoric correlation should be preferred, but both showed much more desirable behavior than Jaccard). And in terms of convenience and interpretation, fold enrichment is well established in bioinformatics in general. One might thus consider whether it would be desirable and practical to even have fold enrichment (Forbes) as default similarity metric (if so, I guess preferably without breaking backwards compatibility..).

arq5x commented 4 years ago

So sorry for the delay. We all feel that this would be a very valuable addition and would pave the way for future similarity metrics / approaches that we are envisioning. We think a new similarity tool would be great and open to feedback about what might be needed in the future given all of the work you have done!