Automattic / mongoose

MongoDB object modeling designed to work in an asynchronous environment.
https://mongoosejs.com
MIT License
26.93k stars 3.84k forks source link

Convert populate to use $lookup #3683

Closed niftylettuce closed 8 years ago

niftylettuce commented 8 years ago

Before:

lib.db.model('Stock').find(stockQuery).populate('product') // ...

After:

            lib.db.model('Stock').collection.aggregate([
              { $match: stockQuery },
              {
                $lookup: {
                  from: 'products',
                  localField: 'product',
                  foreignField: '_id',
                  as: 'product'
                }
              },
              {
                $project: {
                  product: {
                    $arrayElemAt: [ '$product', 0 ]
                  }
                }
              },
              { $sort: sort },
              { $limit: req.query.limit },
              { $skip: req.offset }
            ], callback);

Can we normalize this functionality for population using aggregate and $lookup into a rewrite of the populate function (so it would still function the same, but it'd use $lookup instead?

niftylettuce commented 8 years ago

By default $lookup returns an array of results, and therefore that is why for a single document DB Ref I've put in the $arrayElemAt operator.

niftylettuce commented 8 years ago

Note that $lookup is only available in MongoDB 3.2

vkarpov15 commented 8 years ago

Its worth considering, but not for the forseeable future because I'd imagine the majority of mongoose users don't use mongodb 3.2. Also, the only rationale that I could think of for doing this would be "it improves performance considerably." While that might be the case, we'd have to run a benchmark to see.

macrostart commented 8 years ago

+1 for using $lookup internally.

Also, how about implementing this conditionally? This way mongoose users who work with MongoDB 3.2+ could already benefit from the (suspected) performance improvements.

vkarpov15 commented 8 years ago

Possible, but it's not really a high priority atm. The difficult part is that $lookup only works with the aggregation framework, so .find().populate() needs to get converted into an aggregation query, which seems unwieldy.

avchugaev commented 8 years ago

+1 for $lookup.

santimacia commented 8 years ago

Is there real benefits of using $lookup insteal of populate? that's the point, however mongodb3.2 it's still to new for the majority of users.

vkarpov15 commented 8 years ago

Performance in some cases - populate() requires 2 distinct queries to mongodb and thus 2 sequential round trips over the network, $lookup only takes 1. However, there's ways to structure your schema so populate-like operations can be done in parallel, so $lookup doesn't help as much as you might think.

santimacia commented 8 years ago

Another thing to consider about $lookup within arrays

http://stackoverflow.com/questions/34967482/lookup-on-objectids-in-an-array

vkarpov15 commented 8 years ago

Ah yeah that would be tricky to implement - we'd need to be smart enough to build an aggregation pipeline that includes $unwind for every nested array.

TrejGun commented 8 years ago

+1

pswoodworth commented 8 years ago

+1 on principle, but does anyone have any good benchmarking data? Good news, though: it looks like the issue with $lookup on arrays will be resolved as of mongo 3.3.4: https://jira.mongodb.org/browse/SERVER-22881

faller commented 8 years ago

+1

sidgate commented 8 years ago

populate works with multiple database. How would $lookup work then?

vkarpov15 commented 8 years ago

@sidgate that's a great point, it actually would not work at all and that's a pretty core feature for mongoose. I'm gonna close this issue because neither mongodb 3.2 nor 3.4 are going to support cross-db $lookup...

niftylettuce commented 8 years ago

For anyone interested in an ES7 version of this, here you go!

// get paginated result of job postings using $lookup
  // <https://github.com/Automattic/mongoose/issues/3683>
  let [ positions, itemCount ] = await Promise.all([
    Positions.aggregate([
      { $match },
      {
        $lookup: {
          from: 'users',
          localField: 'user',
          foreignField: '_id',
          as: 'user'
        }
      },
      {
        $project: {
          user: {
            $arrayElemAt: [ '$user', 0 ]
          }
        }
      },
      { $sort },
      { $limit: ctx.query.limit },
      { $skip: ctx.paginate.skip  }
    ]),
    Positions.count($match).exec()
  ]);

Per https://github.com/crocodilejs/crocodile-node-mvc-framework

I'm also using koa-ctx-paginate with this at https://github.com/koajs/ctx-paginate

vkarpov15 commented 8 years ago

Good stuff @niftylettuce but async/await is not ES6 :p

niftylettuce commented 8 years ago

Typo 😆 🚀

vkarpov15 commented 8 years ago

Hopefully it'll make it into ES2017 :rocket:

kintel commented 8 years ago

This doesn't change the complications of implementing this., but just a comment on how $lookup could "improve performance considerably":

Using populate on a cursor tends to be unusably slow due to the amount of single queries being performed.

One way around this could be to build a batched cursor, populate per batch, and then unpack each document in that batch and provide it as a readable stream.

vkarpov15 commented 8 years ago

That could work. Unfortunately, $lookup still executes a separate query for each individual doc, so you're saving a round trip by using $lookup but using approximately the same db processing time and hogging the connection. I like the cursor batch idea though :+1:

lykmapipo commented 7 years ago

@vkarpov15 What is the state of allow population by lookup. Can we use it whenever possible and revert to normal population whenever necessary.

vkarpov15 commented 7 years ago

@lykmapipo not planning on doing this currently because $lookup doesn't support lookups between different databases, so this would break the populate API. Yeah we could fall back to doing conventional populate in cross-db cases but that would add a lot of code and be more work than it's really worth IMO.

niftylettuce commented 7 years ago

Can we revisit this? Why do we need to do cross database lookups?

warrenday commented 7 years ago

Seems like closing this issue due to cross-db issues is a cheap excuse, the feature would be excellent.

niftylettuce commented 7 years ago

Per my chat with @vkarpov15 on Slack, can someone in the community create a performance test with these requirements to compare $lookup vs populate?

6 concurrent $lookups with 2 fast findOne() queries VS. 6 populates with 2 fast findOne() queries

zbjornson commented 7 years ago

Here's a handful of benchmarks with MongoDB 3.4 (v3.2 was similar). Numbers are the median of 15 replicates.

findOne

query 1x concurrency 6x 15x
primary query 0.911 ms 9.226 19.55
primary query, then 2nd query for relationship (~.populate()) 1.201 ms 5.782* 11.44
$lookup 0.598 ms 1.763 4.575

*Note: not sure why this is faster than first row; might be cached at some layer, but it's consistent.

find returning array of 140 docs

query 1x concurrency
primary query 3.401 ms
primary query, then 2nd query for relationship (~.populate()) 6.872 ms
$lookup 8.705 ms

Code: https://gist.github.com/zbjornson/746a477ee715685fb60963d9e3c5d2d1. The foreignField was indexed, and I verified that the index was used. The collection had about 60k documents in it. I used the mongodb driver directly for the benchmark to show the best possible case.

Looks like it helps with latency for individual documents, but is detrimental for arrays (low-throughput). In the MongoDB profiler it looks like each $lookup goes out as a separate query, so that makes sense.

tunatoksoz commented 7 years ago

There are things that I found that works with populate but not with lookup. I think a model that looks like the following doesn't work well with populate (i can't find the $lookup i wrote for it, though).

Schema({ nestedObject: { item1: { name: 'tuna' }, list: [{ type: ObjectId, ref: 'AnotherModel' }] } });

I will try to find a more concrete example, but that was my recollection of my attempts at replacing a nested populate.

vkarpov15 commented 7 years ago

@tunatoksoz yeah if you want to use $lookup with an array you need to add a $unwind.

@zbjornson yeah that doesn't surprise me too much. $lookup does end up as a separate query for each doc because of how aggregation works, so the performance impact isn't quite as cut-and-dried. I'd imagine the concurrency impact is even bigger on .find()...

Ncifra commented 5 years ago

Seems like closing this issue due to cross-db issues is a cheap excuse, the feature would be excellent.

Just wanted to weigh in that the cross-db feature has been a time saver for us, and I think it's one of those things that really makes mongoose an ODM, an extra layer of complexity with additional features from the base driver.

vkarpov15 commented 5 years ago

For future reference, here's an example of where $lookup gets super slow:

  const db = await mongodb.MongoClient
    .connect('mongodb://localhost:27017/test', {
      useNewUrlParser: true,
      poolSize: 1 // Only 1 operation can run at a time
    })
    .then(client => client.db())

  await db.dropDatabase()

  for (let i = 0; i < 5000; ++i) {
    await db.collection('Foo').insertOne({ _id: i })
  }
  console.log('Inserted foo docs')

  for (let i = 0; i < 5000; ++i) {
    await db.collection('Bar').insertOne({ _id: i, fooId: i })
  }
  console.log('Inserted bar docs')

  // This aggregation takes 150ms on my laptop, That's because this query executes one full
  // collection scan of 'Bar' for every document in 'Foo', or 20000 collection scans
  const promise = db.collection('Foo').aggregate([
    { $lookup: { from: 'Bar', localField: '_id', foreignField: 'fooId', as: 'bars' } }
  ]).toArray()

Versus what populate does:

  let startTime = Date.now()
  const foos = await db.collection('Foo').find().toArray()
  // "Found foos in 28ms"
  console.log(`Found foos in ${Date.now() - startTime}ms`)

  startTime = Date.now()
  const bars = await db.collection('Bar').find({ fooId: { $in: foos.map(f => f._id) } }).toArray()
  // "Found bars in 15ms"
  console.log(`Found bars in ${Date.now() - startTime}ms`)
  const barMap = bars.reduce((map, b) => map.set(b.fooId.toString(), b), new Map())

  for (const foo of foos) {
    foo.bar = barMap.get(foo._id.toString())
  }

$lookup compounds index misses, populate() minimizes their impact because it only executes one query for each populate(), regardless of the number of docs being populated.