Filter _changes by key to speed up replication

redgeoff commented 7 years ago

Expected Behavior

From what I understand, using a view for filtered replication is a lot faster than a filter function as after a view is built it can be reused. I believe that filter functions on the other hand need to be executed each time there is a replication and are not "cached." So, I've found that as a general rule, you should try to use a view whenever you need to filter replication.

The issue is that in order for me to swap out some of my filter functions with views for filtered replication then I'd need to be able to query the _changes endpoint using a key parameter. Is there a reason that a key parameter cannot be added to the _changes API. (Of course, if this is added then the key parameter could also be added to the PouchDB Replication API. This enhancement would lead to significant speed increases in CouchDB/PouchDB initial replication for many data sets)

Current Behavior

The _view API has a key parameter and it functions great when you want to get a list of all the latest docs. No such key parameter exists for the _changes endpoint.

Steps to Reproduce

The following is a simple example that illustrates the need. Assume a database named people with the design doc

{
  _id: '_design/views',

  views: {

    students_by_class: {
      map: [
        'function(doc) {',
        'if (doc.type === "student") {',
        'emit(doc.class, null);',
        '}',
        '}'
      ].join(' ')
    }

  }
}

And assume the following docs:

{
  type: 'teacher',
  name: 'Teacher 1',
  class: 'Calculus'
},
{
  type: 'student',
  name: 'Student 1',
  class: 'Calculus'
},
{
  type: 'student',
  name: 'Student 2',
  class: 'Algebra'
},
{
  type: 'student',
  name: 'Student 3',
  class: 'Calculus'
}

Currently, I can only query for all students and not for those in a specific class:people/_changes?filter=_view&view=views/students_by_class. Instead it would be great to be able to issue people/_changes?filter=_view&view=views/students_by_class&key=Calculus because then I could use a view instead of a filter function in my replication.

Context

Not having the key parameter makes replication a lot slower

Other

I'd be willing to take a stab at modifying the CouchDB source to add this feature, but I just want to confirm that it is a good idea and that I'm not oversimplifying the enhancement.

nickva commented 7 years ago

Maybe another option is to use a Mango based selector as described here:

https://blog.couchdb.org/2016/08/15/feature-replication/comment-page-1/ (see A New Way to Filter section).

Then each replication document would specify a selector to pick from a specific class only.

By default the basic view based filtering doesn't end up using the view data and just re-uses the map function in the most naive way.

There is a feature which allows using the view. It was implemented here:

https://github.com/apache/couchdb-couch-mrview/pull/2

I think it involves added these options to the view "options": {"seq_indexed": true, "keyseq_indexed": true}}. But have never used it and not sure how it works.

redgeoff commented 7 years ago

@nickva great point on the selector and that is exactly what we have already done as an interim solution and it appears to be a little faster. Ultimately though, using a view with a cached index would be faster, no?

So, in the case of the view endpoint: people/_design/views/_view/students_by_class?key=Calculus, isn't CouchDB using the view data? I haven't dug into the code, but from a high level testing perspective, this type of query appears to run quite fast.

eiri commented 6 years ago

For what it worth, seq and kseq indexing in mrview is broken right now #592, so in Couch 2.1 so called fast_view for replication shorted cut to regular view filtering, that evaluates each doc interactively and not actually using built index.

kocolosk commented 6 years ago

@redgeoff the trick is that replication needs to be incrementally resumable, so if you want to build an index to drive filtered replication the server also needs to maintain a second internal index to get a list of all the changed and deleted keys in that index since your last request. Hopefully that makes sense. That's the PR @nickva pointed you to, although as @eiri mentioned it's not currently functional.

You're certainly right that the normal _view API uses a dedicated index and is quite fast, but in the default configuration a view has no way to efficiently deliver you a subset of recently changed rows.

redgeoff commented 6 years ago

Ah interesting, so this is more complicated than I had hoped :(. I guess we'll have to wait until #592 is fixed before a key parameter can be added and actually used.

wohali commented 6 years ago

@nickva can you confirm that, for the Mango case, a pre-built Mango index is similarly not leveraged to speed up filtered replication for the same reasons as raised in #592? I would expect this is the case, since Mango uses mrview indices, correct?

nickva commented 6 years ago

That's true. Currently the Mango selector passed to the changes feed doesn't use any Mango indices. It is a pure filter passed in with the request body. Could think of it for example, as a more optimized js filter function which doesn't live in a source _design document and doesn't need to do the extra work of going to/from external javascript engine process.

igorski89 commented 6 years ago

@nickva is theoretically possible to speed-up selectors passed in replicator to use indexes? I have the same issue as @redgeoff posted and rewriting my map functions as mango selectors. Since the map functions often time written somethine like:

function (doc) {
   if (doc.type == 'foo' && doc.safe && ...) {
      emit(...)
   } 
}

using partial indexes would make a lot of sense overall.

igorski89 commented 6 years ago

@nickva I guess i know the answer: to build _changes feed the changes themselves have to be indexed somewhere, meaning a similar to seq_indexed option must be implemented for selectors, which pretty much brings us to point 1

nickva commented 6 years ago

@evsukov89 Yes, you are exactly right!

oliverjanik commented 6 years ago

I believe we're seeing this too. Our JS filter takes a parameter from the request.

We'd love to try mango filter, but I can't find how to access a request parameter from mango.

e.g. "$gt": request.startDate

willholley commented 6 years ago

@oliverjanik when using a mango filter for changes the entire filter is specified in the request body , so you can just adjust the filter as you need instead of passing custom parameters. For example:

POST /mydb/_changes?filter=_selector
{
    "selector": { "startDate": { "$gt": "2018-07-01" } }
}

janl commented 6 years ago

Closing as there isn’t anything actionable in here.

redgeoff commented 6 years ago

@janl I wouldn’t recommend closing this as I think this is a vital feature. I’d even argue that it is a design flaw in CouchDB. Using mango selectors for this is slow and would probably be much too slow for a large dataset. If this remains closed is there a list that this will be added to?

I understand that this would take time to implement and that someone has to do the work :). I just don’t want to see something like this get forgotten.

wohali commented 6 years ago

@redgeoff wouldn't the fix just be "make Mango faster" here?

redgeoff commented 6 years ago

Hehe... well, that may be a nice enhancement, but what I really think is:

592 needs to be fixed so that indexed views can be used for filtered replication
I think some sort of new indexing construct is needed so that you can create an index based on key so that you can specify a key when using a view during filtered replication. I'm not entirely sure of the inner workings of CouchDB so I can't be much more specific here. Sorry.

Currently the only way to avoid the latency with this slow filtered replication is to manually shard the database, e.g. in my example above, create a database per class type, but this can get very unmanageable when the scenario gets more complicated. (Creating a new database allows you to partition the data and makes the replication faster).

igorski89 commented 6 years ago

@wohali lets take an example: you have an app that uses couchdb as the main data storage. It is intended to be used with offline-first clients (PouchDB/CouchBase Lite 1.x). Let’s assume that overall there is about 1k docs to replicate.

Generally speaking you have two options:

on the server create a new “proxy” database, replicate this 1k docs there and finally use it as a source from offline clients to replicate from
use filtered replication

Analyzing the consequences of each solution:

added complexity, data duplication, additional server usage resources - in my understanding per-user db was at least partiality made to addresss this problem, but I think the shortcomings are known by now
since the _changes feed is not indexed, what is the CPU+IO cost of filtering 1k docs over the db of 1M docs? what if you have 1k users, 10k, 100k?

@redgeoff we have a patch of indexed changes implementation based on views which works well for us, including filtering by key, etc. We can share the details if you wish.

I do however agree this approach has a few shortcomings: view and _changes outputs are not exactly idempotent. For ex: if the view no longer emits the doc, how this should be reflected in the _changes feed?

@janl @wohali I’m sorry for my ignorance, but would you mind pointing me to the page explaining the process of submitting proposals for couchdb?

wohali commented 6 years ago

@evsukov89 Sign up to and then email dev@couchdb.apache.org, instructions are on https://couchdb.apache.org/

I haven't read it closely, but I believe you may have missed some specific failure scenarios in your approach. PRs with approaches to filter a _changes feed by a view's secondary index in the past have failed due to invalid logical assumptions that I'm failing to remember just now - beyond just deletes not surfacing.

igorski89 commented 6 years ago

@wohali thank you.

Yes as I said – works for us because we have a very narrow controlled use-case, I do not believe this should be like this in master, still would like to start a discussion in this direction.

apache / couchdb