commonsearch / cosr-results

Common Search sub-project for improving the quality & relevance of search results
https://about.commonsearch.org/developer/result-quality
6 stars 1 forks source link

Bad search scoring - FREE MILF MOM #14

Open tfmorris opened 8 years ago

tfmorris commented 8 years ago

URL of the results:

https://uidemo.commonsearch.org/?g=en&q=new+engla+volleyball

Describe the issue precisely:

I was editing an existing search query and as I was typing, I saw this variant flash the result "FREE MILF MOM" on livesexbook.com as I was going by. Not sure how that relates to even the mangled query.

Also, the search https://uidemo.commonsearch.org/?g=en&q=new+england+volleyball returns:

  1. Volleyball England www.volleyballengland.org
  2. New England Region Volleyball Association | A sub-region of USA... nevolleyball.org

which seems like a backward ordering to me since #2 has an all the words, but #1 doesn't.

sylvinus commented 8 years ago

@tfmorris, hands down the best issue so far ;-)

This unfortunate page seems to be the only one currently in the index containing all 3 words: https://explain.commonsearch.org/?g=en&q=new+engla+volleyball

when you type that exact same query in google, it looks for "england" instead. Not sure what we should do here! Do we want to have an "exact" mode like them? I'm never fond of adding parameters to the search.

As you can see in the explain output for your second query, both have all the words: https://explain.commonsearch.org/?g=en&q=new+england+volleyball

So currently for this case: (2 words in url + 2 words in title + 1 in body > 1 word in url, 3 words in title). Should it be the other way?

The best way of fixing this would be to recognize "New England" as an entity, but that's not on the short-term roadmap.

OriPekelman commented 8 years ago

Hmm do we want a safe-search option?

tfmorris commented 8 years ago

I'm not sure that you necessarily need entity recognition to be able to handle New England as a phrase. I suspect that it could be done with n-gram frequency or something else "dumber" than full entity recognition (search isn't my area of expertise). Often "new" is a relatively insignificant adjective, but in this context it has important significance.

I'm not sure words in the URL should count much at all. If you look at the Google search results: https://www.google.com/search?q=new+england+volleyball there are a whole bunch of top hits that use abbreviations like NERVA, NECVL, etc.

And yes, Safe Search would definitely need to be part of a production service. I'd leave it turned off, but it should probably default to being on. More important to fix relevancy though.

tfmorris commented 8 years ago

p.s. In addition to the other English volleyball pages near the top of the list, further down there is:

YMCA www.newdelhiymca.in

While fixing bi-gram identification of "New Delhi" would probably solve this, I'd also argue that if it's not part of a phrase "new" should be a stop word that's either not considered or given very, very low weight.