artsy / metaphysics

Artsy's GraphQL API
MIT License
360 stars 89 forks source link

[Home] Optimise random works queries #449

Open alloy opened 8 years ago

alloy commented 8 years ago

MongoDB does not have built-in ‘random’ functionality, some possibilities:

  1. Add a ‘random’ attribute to the collection: https://github.com/mongodb/cookbook/blob/master/content/patterns/random-attribute.txt
  2. Skip: https://alan-mushi.github.io/2015/01/18/mongodb-get-random-document-benchmark.html#the-skip-trick-method-4a--4b
  3. Alternatively we could make something ourselves that is based on the way the MP code does it now.
work_set_size = size * 3
scope = Model.where(…)
work_set = scope.skip(rand(scope.count - work_set_size)).limit(work_set_size)
work_set.to_a.shuffle.first(size)

@cavvia @joeyAghion @mzikherman Do you have any thoughts on this?

cavvia commented 8 years ago

I'm just wondering what use case requires the selection of random works? Are you interested in rotating content? It doesn't sound like an optimal ranking strategy for most contexts I can think of.

Sorting by descending merchandisability, iconicity, or creation date are some of our other options. The v1/filter/artworks endpoint also supports a -decayed_merch sort which combines artwork freshness and merchandisability score.

alloy commented 8 years ago

Aye, yeah, it’s about rotation, but specifically on each reload, as discussed during the home personalisation talks: https://docs.google.com/document/d/1QJ5NNK_LqVwomlqIg3MOojtRxgTxg9ky4q6ynfPrrUg/edit#heading=h.xn6uqaz56o0r

joeyAghion commented 8 years ago

I agree that we probably don't want truly random results, but we might want a variety from the "best" likely candidates (e.g., using some of the sorts @cavvia mentions). If we can accomplish this with existing page/size parameters to skip a small number of results, I wouldn't be too worried about performance. If we really want to skip to a random result in the collection (i.e., skipping a very large number), I would.

alloy commented 8 years ago

@joeyAghion I’m unsure what the preferred option is that you’re referring to. Is it option 3 where we implement a version similar to what MP does on Gravity or are you suggesting we do fetch 3 times the data from Gravity than is actually being requested by the client?

joeyAghion commented 8 years ago

I was just suggesting that we sort by something meaningful and then index randomly into the early pages (e.g., fairs?size=5&page=[1-10], similar to (2) but using existing parameters). But now that I look at the examples you link to, I realize there may not be enough data to page significantly. Maybe (3) [or basically what MP does now] is best in the short term. Returning 60 results for each row that only reveals 5-6 is a lot though! Could we decrease that to ~20?

mzikherman commented 8 years ago

Yea, I think the sentiments @cavvia and @joeyAghion sound like the right track.

We've done the random attribute thing, as I think that may be the only way to really get a random shuffling from a bigg-ish collection. The potentially large arbitrary skip/offset can get super slow and you wind up not really using indexes properly.

I think (3) sounds like the most reasonable option, pretty much like what @joeyAghion was suggesting.