Closed redgeoff closed 6 years ago
Maybe another option is to use a Mango based selector as described here:
https://blog.couchdb.org/2016/08/15/feature-replication/comment-page-1/ (see A New Way to Filter section).
Then each replication document would specify a selector to pick from a specific class only.
By default the basic view based filtering doesn't end up using the view data and just re-uses the map function in the most naive way.
There is a feature which allows using the view. It was implemented here:
https://github.com/apache/couchdb-couch-mrview/pull/2
I think it involves added these options to the view "options": {"seq_indexed": true, "keyseq_indexed": true}}
. But have never used it and not sure how it works.
@nickva great point on the selector and that is exactly what we have already done as an interim solution and it appears to be a little faster. Ultimately though, using a view with a cached index would be faster, no?
So, in the case of the view endpoint: people/_design/views/_view/students_by_class?key=Calculus
, isn't CouchDB using the view data? I haven't dug into the code, but from a high level testing perspective, this type of query appears to run quite fast.
For what it worth, seq
and kseq
indexing in mrview is broken right now #592, so in Couch 2.1 so called fast_view
for replication shorted cut to regular view filtering, that evaluates each doc interactively and not actually using built index.
@redgeoff the trick is that replication needs to be incrementally resumable, so if you want to build an index to drive filtered replication the server also needs to maintain a second internal index to get a list of all the changed and deleted keys in that index since your last request. Hopefully that makes sense. That's the PR @nickva pointed you to, although as @eiri mentioned it's not currently functional.
You're certainly right that the normal _view
API uses a dedicated index and is quite fast, but in the default configuration a view has no way to efficiently deliver you a subset of recently changed rows.
Ah interesting, so this is more complicated than I had hoped :(. I guess we'll have to wait until #592 is fixed before a key
parameter can be added and actually used.
@nickva can you confirm that, for the Mango case, a pre-built Mango index is similarly not leveraged to speed up filtered replication for the same reasons as raised in #592? I would expect this is the case, since Mango uses mrview indices, correct?
That's true. Currently the Mango selector passed to the changes feed doesn't use any Mango indices. It is a pure filter passed in with the request body. Could think of it for example, as a more optimized js filter function which doesn't live in a source _design document and doesn't need to do the extra work of going to/from external javascript engine process.
@nickva is theoretically possible to speed-up selectors passed in replicator to use indexes? I have the same issue as @redgeoff posted and rewriting my map functions as mango selectors. Since the map functions often time written somethine like:
function (doc) {
if (doc.type == 'foo' && doc.safe && ...) {
emit(...)
}
}
using partial indexes would make a lot of sense overall.
@nickva I guess i know the answer: to build _changes
feed the changes themselves have to be indexed somewhere, meaning a similar to seq_indexed
option must be implemented for selectors, which pretty much brings us to point 1
@evsukov89 Yes, you are exactly right!
I believe we're seeing this too. Our JS filter takes a parameter from the request.
We'd love to try mango filter, but I can't find how to access a request parameter from mango.
e.g. "$gt": request.startDate
@oliverjanik when using a mango filter for changes the entire filter is specified in the request body , so you can just adjust the filter as you need instead of passing custom parameters. For example:
POST /mydb/_changes?filter=_selector
{
"selector": { "startDate": { "$gt": "2018-07-01" } }
}
Closing as there isn’t anything actionable in here.
@janl I wouldn’t recommend closing this as I think this is a vital feature. I’d even argue that it is a design flaw in CouchDB. Using mango selectors for this is slow and would probably be much too slow for a large dataset. If this remains closed is there a list that this will be added to?
I understand that this would take time to implement and that someone has to do the work :). I just don’t want to see something like this get forgotten.
@redgeoff wouldn't the fix just be "make Mango faster" here?
Hehe... well, that may be a nice enhancement, but what I really think is:
Currently the only way to avoid the latency with this slow filtered replication is to manually shard the database, e.g. in my example above, create a database per class type, but this can get very unmanageable when the scenario gets more complicated. (Creating a new database allows you to partition the data and makes the replication faster).
@wohali lets take an example: you have an app that uses couchdb as the main data storage. It is intended to be used with offline-first clients (PouchDB/CouchBase Lite 1.x). Let’s assume that overall there is about 1k docs to replicate.
Generally speaking you have two options:
Analyzing the consequences of each solution:
@redgeoff we have a patch of indexed changes implementation based on views which works well for us, including filtering by key, etc. We can share the details if you wish.
I do however agree this approach has a few shortcomings: view and _changes outputs are not exactly idempotent. For ex: if the view no longer emits the doc, how this should be reflected in the _changes feed?
@janl @wohali I’m sorry for my ignorance, but would you mind pointing me to the page explaining the process of submitting proposals for couchdb?
@evsukov89 Sign up to and then email dev@couchdb.apache.org, instructions are on https://couchdb.apache.org/
I haven't read it closely, but I believe you may have missed some specific failure scenarios in your approach. PRs with approaches to filter a _changes feed by a view's secondary index in the past have failed due to invalid logical assumptions that I'm failing to remember just now - beyond just deletes not surfacing.
@wohali thank you.
Yes as I said – works for us because we have a very narrow controlled use-case, I do not believe this should be like this in master, still would like to start a discussion in this direction.
Expected Behavior
From what I understand, using a view for filtered replication is a lot faster than a filter function as after a view is built it can be reused. I believe that filter functions on the other hand need to be executed each time there is a replication and are not "cached." So, I've found that as a general rule, you should try to use a view whenever you need to filter replication.
The issue is that in order for me to swap out some of my filter functions with views for filtered replication then I'd need to be able to query the _changes endpoint using a key parameter. Is there a reason that a key parameter cannot be added to the _changes API. (Of course, if this is added then the key parameter could also be added to the PouchDB Replication API. This enhancement would lead to significant speed increases in CouchDB/PouchDB initial replication for many data sets)
Current Behavior
The _view API has a
key
parameter and it functions great when you want to get a list of all the latest docs. No suchkey
parameter exists for the _changes endpoint.Steps to Reproduce
The following is a simple example that illustrates the need. Assume a database named
people
with the design docAnd assume the following docs:
Currently, I can only query for all students and not for those in a specific class:
people/_changes?filter=_view&view=views/students_by_class
. Instead it would be great to be able to issuepeople/_changes?filter=_view&view=views/students_by_class&key=Calculus
because then I could use a view instead of a filter function in my replication.Context
Not having the key parameter makes replication a lot slower
Other
I'd be willing to take a stab at modifying the CouchDB source to add this feature, but I just want to confirm that it is a good idea and that I'm not oversimplifying the enhancement.