Open ivargr opened 4 years ago
I think this is a very interesting idea. I will dig into this a bit next week and follow up soon.
@ryanlayer and @brentp what do you think?
Seems like adding a —similarity
On Nov 30, 2019, at 11:34 AM, Aaron Quinlan notifications@github.com wrote:
@ryanlayer and @brentp what do you think?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Would be great to reach a conclusion here:) As Ivar has not started any concrete implementation yet, I guess he would be open to any approach you would prefer (and I think he is still interested in implementing). I think it would be great for the field to have such flexibility available. Especially since we in the study mentioned by Ivar found that the choice of measure has a dramatic impact on relative similarity values (rankings etc) whenever there is a large variation in track size (genomic coverage). After looking for arguments in all directions, we also ended up with a surprisingly strong case for fold enrichment (Forbes) in favor of Jaccard (it was a bit unclear whether Forbes or Tetrachoric correlation should be preferred, but both showed much more desirable behavior than Jaccard). And in terms of convenience and interpretation, fold enrichment is well established in bioinformatics in general. One might thus consider whether it would be desirable and practical to even have fold enrichment (Forbes) as default similarity metric (if so, I guess preferably without breaking backwards compatibility..).
So sorry for the delay. We all feel that this would be a very valuable addition and would pave the way for future similarity metrics / approaches that we are envisioning. We think a new similarity
tool would be great and open to feedback about what might be needed in the future given all of the work you have done!
Hi!
I think it would be nice if Bedtools supported other measures than the Jaccard index for measuring the similarity between two tracks (bed files), and I am interested in submitting a pull request that includes functionality for computing other measures, such as the Forbes coefficient (which is quite similar to Jaccard, but has shown to be better to use in many cases).
Implementing Forbes in bedtools seems quite trivial -- I think it is just to copy all the jaccard-related files, make a new subcommand
forbes
and change one or two lines in the copied code. However, this would lead to a lot of duplicate code (between the jaccard and forbes subcommands). This might be OK when having only two similarity measures, but I think it would become a mess if more similarity metrcis are to be added. Thus, I think it would be better to make a new subcommand (e.g.bedtools similarity
) which takes a parameter to specify the similarity to be used (this could be jaccard by default). Thejaccard
subcommand could be kept for backwards compatibility.I guess my questions are:
bedtools similiarity
) then would be the way to go, or is it better to just add a new subcommand for each new similarity metric?PS: The motiviation for this is partly this paper (https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz083/5586919) of which I am a co-author.