galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Adding (back) tool reviews and using them to rank search? #8987

Open mvdbeek opened 4 years ago

mvdbeek commented 4 years ago

Discussed this with @martenson in the context of removing repository reviews from the toolshed

marten 4:24 PM and then repository reviews I would like to see that revived, it is more important than it was back then since we have more tools and more trouble finding the good ones. Not revived in the form it had, that was not usable. But maybe rating tools from Galaxy side with toolshed being the one that gather all these ‘stars’?

Marius van den Beek 4:42 PM ‘stars’ might also correspond to number of jobs executed on a given instance will always disfavor new ones though

marten 4:43 PM that is a path to a set lock-in

Marius van den Beek 4:44 PM I don’t know that I would go by stars, as a user I’d probably be more interested in whether they’re used in a tutorial

marten 4:44 PM what does that give you?

Marius van den Beek 4:44 PM in what context to use a tool and how

marten 4:44 PM so that is a separate problem from searching, right?

Marius van den Beek 4:45 PM the good ones are the ones that get the job done, so for me that’s related I don’t know, I guess I’m not naive enough to have a good idea about how I would go about discovering tools. probably material and methods of papers that do things I want to do I don’t think I’d sit down in front of Galaxy and explore things with lots of stars

marten 4:48 PM Ultimately it is a similar metric though? Experience of people that are using it. (It could also be very valuable feedback for tool authors)

Marius van den Beek 4:49 PM but for what purpose? I don’t think the usefulness of a scientific tool can be conveyed by stars

marten 4:49 PM I called it stars, but I have something like a score+micro review in my mind.

Marius van den Beek 4:49 PM I’d like to know 1) if it’s broken 2) what it does

marten 4:50 PM score would be used to weight search results; reviews for feedback to tool authors and admins

Marius van den Beek 4:51 PM Do naive users use the search ? If I use galaxy i just write the tool name or id and I’d be very annoyed if the ranking is not matching my search terms I guess some extended search might make sense if you want to know what to with a very specific format

marten 4:53 PM image

Marius van den Beek 4:53 PM I mean reviews would be very cool but I don’t think it should reflect search

marten 4:53 PM 80% of searcher refine the search

Marius van den Beek 4:54 PM of course, we don’t require to hit enter to find a tool 4:54 PM I don’t think that shows that naive users use the search?

marten 4:54 PM I am not trying to prove it here, just throwing data out there 4:55 PM I am not even sure what ‘naive users’ group is

Marius van den Beek 4:55 PM here I mean users that would benefit from good suggestions

marten 4:56 PM I hope everybody benefits from good suggestions.

Marius van den Beek 4:56 PM I don’t think our suggestions can be good enough to benefit a person that knows what tool they’re looking for

marten 4:56 PM I mean the fact that you are used to always run trimmomatic does not mean that you will never change it. Maybe there will be a better trimmer installed on that galaxy one day.

Marius van den Beek 4:57 PM that’s a good example, trimmomatic works fine, but is dead slow you shouldn’t be using it but how can you rank a search with that information ?

marten 4:58 PM Dead slow trimmer would have less stars than faster trimmer? And you are searching on the same keyword.

Marius van den Beek 4:58 PM but then what do stars mean ?

marten 4:58 PM user satisfaction

Marius van den Beek 4:59 PM so it works, why not give it 5 stars ?

marten 4:59 PM it is slow?

Marius van den Beek 4:59 PM depends what you do

marten 4:59 PM yep, that’s why 10 rating will never do it, but 1000 rating could be useful

Marius van den Beek 4:59 PM you see this is a multidimensional problem that I don’t think can be solved with the fanciest machine learning out there

marten 5:00 PM yet it is an approach that is used daily for searching

Marius van den Beek 5:00 PM yeah, so you have a thousand ratings, and why is that better than 10 ? yes … but there is a clear intent when searching, you can tailor the search etc. I don’t think ranking by user review is helpful in search, it’s more helpful for diagnosing issues, or actually reading the review

if a user says in a review: dead slow, use this other tool, much better, I think that’s very useful and one comment like that outweighs hundreds of review that just says “it works” I think my main beef with the star system doesn’t convey what the user rated

marten 5:06 PM I don’t agree with the weight there at alll

Marius van den Beek 5:07 PM that’s ok, just my thoughts on this

marten 5:07 PM You really have no trust in people’s reviews. :D

Marius van den Beek 5:07 PM I do, a lot, I just don’t think the stars are a good system

marten 5:08 PM fair enough just a food for thought thanks for all the feedback!

Marius van den Beek 5:08 PM I need to know what you rated. If you browse through amazon I think the reviews that write a few lines about an article are super useful

jdavcs commented 4 years ago

Two cents. Ranking search results by user-assigned ratings may quickly skew the ratings: items at the top of the list will be used more, which will lead to more ratings (which will be positive if the tool works as expected), so we’ll get the “rich getting richer” problem (tools with more ratings will get more ratings).

This works in some contexts, of course. But I doubt we’ll ever have enough data to do any kind of meaningful analysis to understand whether (a) someone recommending a tool and (b) someone not recommending a tool are quantifiable criteria for a tool’s quality, or, better, a tool’s applicability to another user’s task.

I think, ratings are data, and as such can be helpful. But I doubt that using that data in calculating search rank will make the rankings any more meaningful.

martenson commented 4 years ago

It would be one of the variables in the scoring with defined weight. I dare to say that when we started to use the number of repo downloads in this manner it improved the search results considerably despite it being another example of 'rich gets richer'.

jdavcs commented 4 years ago

If there is a metric that we use to evaluate search results quality, then we could use it to test whether or not adding that variable (with that weight) improves that quality.

martenson commented 4 years ago

Indeed, if you have ideas how to build that, please share. (xref https://github.com/galaxyproject/galaxy/issues/2272)

jdavcs commented 4 years ago

Sorry, I don't - not for this case. I'll also correct my previous comment: if we had such a metric, I don't think it would help in this particular case.

Here's some speculation not supported by data. I don't think rating a tool highly or rating a tool poorly carries information that is quantifiable for inclusion in calculating the tool's relevance score. A rating could reflect perceived quality, or perceived satisfaction with how intuitive the tool's form is (relevant to a novice), or satisfaction with how flexible a tool is (relevant to an advanced user), or a tool's speed (at some given time), and a lot of other things. I just don't think we can quantify these in a meaningful way.

martenson commented 4 years ago

Like I mentioned to Marius in the conversation above - despite the imperfections I believe this is a useful and fairly common approach for weighting scores.

Taking it to the extreme: what is the basis of the original PageRank? How do you know that a link to a page is a relevant information? Maybe the comment above that link is "never go there, this is the worst page on the whole web". Yet it improves its rank.

A rating could reflect perceived quality, or perceived satisfaction with how intuitive the tool's form is (relevant to a novice), or satisfaction with how flexible a tool is (relevant to an advanced user), or a tool's speed (at some given time), and a lot of other things.

I believe this is the opposite of a counter example. Improvement in any single one of the categories you mention should imho raise the weighted score of a tool A (and thus rank higher compared to a previously identical tool B that scores lower in any of it).

In other words, if there are characteristics that are subjectively (and comparably) better in the tool A, why not use them in the ranking? What does it matter that we do not know what they are specifically? As long as it is a sizable set of data it seems useful to me. Also we can easily scale the weighted score impact with the number of reviewers, or require X reviews before we include the ratings in rank etc.

p.s. This evolved into a fairly non Galaxy debate at some point, sorry. There are other good reasons why gathering a feedback on tools (which could e.g. include a review) is very useful to tool authors, deployers, & admins that I did not mention here.

jdavcs commented 4 years ago

p.s. This evolved into a fairly non Galaxy debate at some point,

The debate is good; I’m sure the result of such discussions is a better galaxy. Also, I have no doubt at all that gathering feedback on tools (ratings in particular) is useful data.

As for my doubts.. Improvement in any of those categories is not necessarily what we are measuring by counting ratings. There may be mutually exclusive criteria (e.g., intuitiveness vs flexibility: ask a Windows user to rate Unix :-) So, if a simple-to-use tool gets rated higher than one with many options, are we measuring the perceived quality of the tool, or the makeup of our user population? And that’s just one issue. My primary doubt here is that I’m not sure we can tell what a tool rating is measuring, and even if it’s overall satisfaction with the tool, I don’t think it’s easily generalizable (as per @mvdbeek , it’s a multidimensional problem - i.e., adding a feature won’t improve the model).

Not to make this an academic debate, but still, with PageRank, a link is interpreted as a measure of relevance. When we search, we don’t search for good or bad stuff; we search for relevant stuff, and good/bad are relative here: what works for some won’t work for others). So, I think, tool ranking != tool relevance. As a counter example, a tool mention in a paper/tutorial/blog/whatever might be considered analogous to a link to a webpage, I think..

Still, this is all speculation. If it’s possible that a rating may be a good measure of tool relevance, I would suggest testing it. We need a way to measure how well our search performs, then make the change and compare the results. And how and what to measure is another fun topic..