algolia / hn-search

Hacker News Search
http://hn.algolia.com
Other
549 stars 74 forks source link

quoted 2-word queries have inconsistant result counts #71

Closed wumpus closed 5 years ago

wumpus commented 8 years ago

I use a browser extension check4change to monitor HN for comments about things I care about, like my employer. I highlight the result count, and if that changes, I get an email.

Queries without quotes work fine, but I get a lot of alerts for 2-word quoted phrases when in fact there have been no new comments about that query.

The particular query is "internet archive" (two words in quotes)

Now as a web search guy, I know exactly why this happens, and I always tell anyone who complains about it "Don't do that, we make the result counts up, really, they're fake, no, don't write research papers based on result counts, really, I mean it, Google mostly makes them up, too, in some circumstances, really, yeah, OK, the last 10 people who emailed me about this didn't believe me either, have a nice day!"

But for you guys and your smallish index, you ought to be able to give a consistent count even if it's a bi-gram that you haven't indexed as such.

wumpus commented 8 years ago

Example false change:

New Text: 23,384 results Original Text: 23,424 results

Looking at the date-sorted results, there's nothing new.

BTW I used to see this problem for single words, but that stopped happening a couple of years ago.

redox commented 8 years ago

Oh it's not really about making the result counts up; it's just that for performance reason we stop the query execution before it actually ends. Then we do a cross multiplication to estimate the number of hits we would have matched if we would have ended the execution.

As soon as you use quoted expressions, it consumes more CPU and we're therefore - most probably - stopping "earlier", triggering the cross multiplication more often as well.

What about filtering on the last few days only? If you do a numericFilters=created_at_i>TIMESTAMP_OF_LAST_FEW_DAYS it will restrict the matching set to only a few thousands records and we'll - probably - end the execution more often -> again, it will not force anything, so you still may have an estimated number of matching hits.

Instead, you should go for our http://hn.algolia.com/api/v1/search_by_date endpoint and remember the first hit's ID -> as soon as the first hit is not the one you stored, you should have a look :)

Does that make sense @wumpus?

wumpus commented 8 years ago

That sort of short-cut algorithm is exactly what I mean by "making it up". No surprise, all search engines do similar things for the same reasons.

With your suggested workaround, if I used a moving endpoint of a week ago, the result counts will constantly change. If I picked a fixed endpoint of say Jan 1 and move it each year, I am likely to become sad near the end of the year. And all of those solutions are not nearly as lazy as check4change + an exact result :-) -- yeah yeah, no reason you should care that I like being lazy. But this is a somewhat common use-case. I bet you'd get a #1 rank and a lot of usage if you made an easy-to-use "monitor for new comments about my keywords" feature.

redox commented 8 years ago

That sort of short-cut algorithm is exactly what I mean by "making it up"

alright :)

if I used a moving endpoint of a week ago, the result counts will constantly change

Yes :/

I bet you'd get a #1 rank and a lot of usage if you made an easy-to-use "monitor for new comments about my keywords" feature.

So maybe you should go for hnwatcher[https://www.hnwatcher.com/], they're using our API to power their service btw. I use it to monitor the algolia keyword :)

wumpus commented 8 years ago

Not surprisingly, hnwatcher doesn't have any results when you use quotes. But yeah, if someone has a simple query, I'm sure it works great

JonasBa commented 5 years ago

@wumpus Right now we don't have a good way of making this exhaustive. We could increase the query timeouts inside the engine, but that solution would again be finite and only patch this to a certain extent and would probably break for 3 word queries etc. As it hasn't been reported and we haven't had many use cases where the nbHits needed to be exhaustive, I will go ahead and close this issue.