Additional Mango-based update handler / VDU functionality

wohali commented 6 years ago

This ticket is intended to gather ideas for what other sorts of update handler / VDU functionality we could add to Mango to provide a declarative replacement for JS-based update handlers and VDUs. This is intended as a "blue-sky" ticket to gather ideas that can be implemented simply in a declarative fashion.

From #1534, I wrote:

There are other proposals to provide similar/useful functionality (#1532, #1533, #1498) if the operator decides to run without a JS engine (#1513), plus I suspect there are thoughts about doing Mango-based update functionality to filter fields, as well as apply database schema enforcement (type and format checking) in a declarative fashion for a NoJS mode. Could your embedded RasPi update function be written declaratively? I assume you're adding info such as incoming IP address, central timestamp, etc. which I could see being done automatically quite easily. This is a use case we can accelerate without the use of JS readily; we just need to agree on an API and how to write it down in a JSON blob.

Extracting:

Add datetime stamp (#1533)
Add IP address from incoming host, or other fields from the req object
Add additional content based on fields in the req object
- Perhaps a lookup table that maps a req field value to a key:value pair (or JSON Patch) to insert?
Filter out (blacklist) or only allow (whitelist) certain fields (lists) - this is partial JSON schema support
Type-checking (field type is string, field date is ISO datetime|UNIX epoch, etc.)
Field validation (format-checking, i.e. string must match a regex like "$###.##" or an integer must be within a certain range)
Consider using output of another Mango index/query for validating a field? For instance, a Mango query of documents that are of type source and extracting the field ip_address, then only allowing documents to come through that have a matching req.ip_address (for example).
- Same as above, but to implement the lookup table. Example: look up source's IP address, auto-add name of sensor which is in the first document that matches in the Mango index

wohali commented 6 years ago

Got to thinking more about this one in the shower. Here's a sample Mango-like document structure for a combined VDU/update handler that would capture all of the above:

{
  "queries": [
    {
      "name": "sensor_name",
      "selector": {
        "type": "sensor",
        "ip_address": "$doc.ip_address"
      },
      "fields": ["name"],
      ...
    }
  ],
  "fields": {
    "whitelist": ["type", "datetime", "ip_address", "station_name", "temperature", "RH%", "rain_gauge", "wind.speed", "wind.direction", "light", "battery.voltage", "battery.current", "line.voltage", "line.current", "solar.voltage", "solar.current", "register_total"],
    "blacklist": ["battery.wattage", "rain_gauge.*"],
    "formats": {
      "type": ["sensor_reading"],
      "datetime": {
        "$or": {
          "$format": "$iso8601",
          "$format": "$unixepoch",
          "$regex": "(\\d{2})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})"
        }
      },
      "battery.voltage": "$float",
      "battery.current": "$float",
      "register_total": { "$regex": "^\\$\\d{1,9}\\.\\d{2}$" },
      "wind.direction": {"$or": ["N", "NE", "E", "SE", "S", "SW", "W", "NW"]}
      ...
    }
  },
  "patch": [
      { "op": "add", "path": "/", "value": { "name": "$query:sensor_name.name" } },
      { "op": "move", "from": "/ip-address", "path": "/ip_address" },
      { "op": "add", "path": "/", "value": { "server_datetime": "$server_time.iso8601" } },
      ...
  ]
}

Comments:

This runs a Mango query up front to get any documents with "type": "sensor" that match the IP address of the incoming request. (Assumes that the $req object has that value, not sure we pass that thru to VDUs right now...)
There's then a field presence check. In a real situation, you would probably only have one of fields.whitelist and fields.blacklist.
Field values/formats are then checked, some just for general structure, one by regex, one by list of acceptable values. The output of a Mango query could be used as input to this, too. (An aside: Mango aggregations could be really useful here, for instance to reduce a query to a single list of acceptable values. See #1254.)
Documents that didn't match "type": "sensor_reading" or any of the other format checks would fall through to the next whatever-we-call-this-thing. If all failed, the document would be rejected.
Maybe we add a "$format": "$datetime" that includes all the common datetime formats?

Just some food for discussion.

ermouth commented 6 years ago

IMO the patch list better be named transform and look like

transform:[
  {
    selector:{  // Mango selector for list of actions
      type: { $neq: "whatever"}
    },
    actions: [  // List of actions
      ["stop"]
    ]
  }, 
  // Next item, used if we didn‘t stop processing
  {selector:{ stamp: { $format:'Int', $gt: 1234567890123}}},
  actions: [
    ["set",  ".ip",  ":peer"   ],
    ["swap", ".a.0", ["a", "1"]],
    ["del",  ".x.y.z"          ]
  ]}
  //...
]

Actions are pretty self-explanatory, so no need to use obj keys. Objects as .actions members may be used later for syntax extensions or nesting.

This also replaces .fields.format typecheck making it more flexible. Typechecks just move to actions list member .selector prop.

Also syntax "battery.voltage": "$float" in fields seems somewhat ambiguous. Using proposed syntax it should be "battery.voltage": {"$format": "Float"}. $ marker before formats/types seems to be excessive, omitting it in favor of uppercasing type tag reads better.

It all seems unexpectedly reasonable. I’ll try to write down some of our _updates in this way, to compare readability and to understand other json-induced benefits/penalties.

And I couldn‘t comprehend what .queries field supposed to do. Can you please give a hint?

wohali commented 6 years ago

@ermouth The patch stuff is coming from JSON Patch, which is an RFC and has widespread support. We're looking at it for partial update support, see #1498. If we build it for that, then we can reuse that support here rather than inventing our own crazy DSL. ;)

wohali commented 6 years ago

Sorry, I hit submit too quickly. The $ before operators is mimicing the Mango syntax, to distinguish operators from operands. You're right that in some places, where the value is expected to be a format (like the float check) we can probably drop the $. As for the format, it's already inside the list for the top-level formats key, so it wouldn't need an additional {"$format":"float"} in my approach, since my approach is intentionally keeping the patch section JSON Patch compatible.

As for queries, the idea is you could run a Mango query before applying the VDU/update handler, then use that information later on. See the first line in the patch entry for use of the lookup to convert an incoming sensor's IP address to a station name. The assumption is that the query will only return the first document, or a reduced set of values (see the reference to Mango aggregation functions being a potential prerequisite for this functionality). Optionally you could cram in there a .0. to reference just the first doc in the result set, I guess. The returned result could also be used in the validator section, for instance, to look up the current list of allowed wind direction values or something.

The pre-update-Mango-query stuff is speculative; there may well be pushback from the db core team on making update functions a little too clever and slow since they're now performing a read before a write. We'll see. I think the proposal is valuable even without that specific functionality.

janl commented 6 years ago

I really like where this is going, great first draft!

Quick initial notes up top:

I like the unification of VDU and _update into one thing that is logically “stuff that happens before a doc hits the db.”
I’d leave out queries in this example, this would open up having to handle cluster network error on the doc write path. We can revisit this at a later point, the current design here allows for an easy extension later on. I think it’ll simplify the already complex discussions required for a baseline feature (which I’m going to extend shortly).

One further aspect that we should discuss the other use of VDUs. To recap, VDUs are used to enforce document schema (this is handled in this draft) and authorisation, where check doc contents against the context of ctx.

http://docs.couchdb.org/en/stable/ddocs/ddocs.html#vdufun

For Example:

function (newDoc, oldDoc, userCtx) {
  // in an authenticated request, userCtx.user is the authenticated username
  // so we can:

  if (oldDoc.author != userCtx.name) {
    throw({ forbidden: 'you can’t update other user’s docs.' })
  }
}

userCtx Definition

In addition, there is a fourth parameter, the secObj, which is the database’s _security object

function (newDoc, oldDoc, userCtx, secObj) {
  if (secObj.members.names.indexOf(userCtx.name) === -1) {
    throw({ forbidden: 'you can’t update if you’re not a member.' }) // contrived example, you wouldn’t get to this part in the first place, because the _security check comes first, but you should get what is meant
  }
}

We could argue that this could co-incide with the security overhaul, which would take care of any authentication, and I’m happy to discuss this separately if needed, I just wanted to bring this up.

Maybe the draft can be amended, so fields becomes schema, and we introduce authorisation as a new top level field:

{
"fields": {
    "whitelist": ["type", "datetime", "ip_address", "station_name", "temperature", "RH%", "rain_gauge", "wind.speed", "wind.direction", "light", "battery.voltage", "battery.current", "line.voltage", "line.current", "solar.voltage", "solar.current", "register_total"],
    "blacklist": ["battery.wattage", "rain_gauge.*"],
    "formats": {
      "type": ["sensor_reading"],
      "datetime": {
        "$or": {
          "$format": "$iso8601",
          "$format": "$unixepoch",
          "$regex": "(\\d{2})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})"
        }
      },
      "battery.voltage": "$float",
      "battery.current": "$float",
      "register_total": { "$regex": "^\\$\\d{1,9}\\.\\d{2}$" },
      "wind.direction": {"$or": ["N", "NE", "E", "SE", "S", "SW", "W", "NW"]}
      ...
    }
  },
  "authorisation": [
       {
         "author": { "$eq" : "$userCtx.name" },
         "throw": "you can’t update other user’s docs." 
    ]}
  }

where the line "throw": "you can’t update other user’s docs." is equivalent to the throw({ forbidden: 'you can’t update other user’s docs.' })-line from the first example above. $secObj would work similarly.

wohali commented 6 years ago

I like this idea, and especially the ability to use multiple docs of this format to enforce rules such as "only users with role sensor are able to add/modify documents of type sensor. It's not fully-fledged per-doc auth since this only is traversed on the write path, but it might be a way forward for the read-path too (if people are willing to accept the massive performance penalty it'll invoke).

The queries idea was blue sky, happy to break it into a separate issue so it doesn't get lost when we implement this one.

I'm flexible on what lives where in the actual doc, would like to see some other opinions as well.

I would say that for authorisation, since you didn't break things out into separate selector and action sections, we use $throw to ensure we don't overlap any tests against a real document key named throw. We'll need to work out escaping just in case someone DOES have a key named $throw, same for the other sections as well.

The alternative is to prefix all document fields with $doc but I think this makes things needlessly verbose, don't you?

janl commented 6 years ago

yeah not a fan of $doc prefix, but $throw works fine.

That might dovetail into my next proposal (that I’ll open a new ticket for), to add string manipulation funs to Mango, those then would follow the same lead, say $strlen(field.sub), but let’s keep that definitely separate from this proposal ;D

mikerhodes commented 6 years ago

@janl should your JSON snippet be (i..e, with the schema and authorization top level fields):

{
    "schema": {
        "whitelist": [
            "type",
            "datetime",
            "others....."
        ],
        ...
    },
    "authorization": [
        {
            "author": { "$eq": "$userCtx.name" },
            "throw": "you can’t update other user’s docs."
        }
    ]
}

Also, I used American spelling for "authorization".

Suggestion:

I think for clarity, splitting out the selector and action in the authorization field as @wohali suggests means you can avoid the ambiguity of field name vs. processing instruction. Collapsing them together feels confusing to me.

Questions:

What would happen when replicating to older db versions that don't understand some part of the validation (e.g., didn't understand formats part of schema)? Be lenient (ignore bits not understood) or strict (fail validation when can't process the whole set of instructions)?
There's an information leak possible: write-only users can discern document shape by observing error messages for schema.
- Perhaps an optional throw clause which would allow the user to override default error to prevent this leakage.

janl commented 6 years ago

@mikerhodes good catch, that's a typo!

wohali commented 6 years ago

@mikerhodes asks:

What would happen when replicating to older db versions that don't understand some part of the validation (e.g., didn't understand formats part of schema)? Be lenient (ignore bits not understood) or strict (fail validation when can't process the whole set of instructions)?

Idea 1: Maybe we need a version field, regardless of what we do with this, if not specified we assume it's the version of whatever the current runtime is Idea 2: Maybe we need a "fail": "closed|open" setting per-VDU Idea 3: Maybe we need a server setting in local.ini to toggle the same thing

The latter 2 come from electronic door locks that generally come with a user-configurable setting for this.

There's an information leak possible: write-only users can discern document shape by observing error messages for schema. Perhaps an optional throw clause which would allow the user to override default error to prevent this leakage.

Well, in Apache CouchDB, there are no such things as write-only users to my knowledge. Are you talking about a Cloudant-specific thing, or something that might come in with #1504 ?

I was imagining that if any of the tests failed (pre-@janl's change) a generic response would be provided, along the lines of document failed validation, maybe including the name of the whatever-we-call-this to point a finger, but not explictly including the precise reason for failure. I like the idea of a $throw.default to override that message if you want, but I'm not sure it does much?

The other option would be to provide a "debug": true/false boolean toggle that would include precise info about the reason for a document's failure. We've done a poor job of helping users debug their JS functions in the past. Putting the control over debugging a Mango-VDU in the hands of the developer would be a big step up in my opinion.

Thanks for the discussion!

mikerhodes commented 6 years ago

Cloudant has write-only, but I did think that it'd likely come in with any access rewrite as it's definitely a useful thing.

ermouth commented 6 years ago

I‘ve tried to re-write several real simple _update fns with proposed syntax. My observations:

schema, whatever implementation is, better have conditionals/guards
update requires conditionals/guards even more: VDU results commute, so you can have many VDUs in different ddocs; this is not true for updates – without conditionals we need as many updates (and endpoints) as number of schemas, which is bit messy and unsafe from integrity POV
transform/patch lists of actions also require guards: schema match doesn‘t automatically mean plain list of actions is applicable
JSON patch is definitely hard to read, too verbose; for long lists repeated key names look as an excessive markup noise
constantly switching between dot notation of Mango paths and slash notation of JSON patch is a direct way to errors
it all needs sandboxed testing tools, entire syntax is not easy to read and, unlike Mango, errors are immediately destructive for incoming data.

making update functions a little too clever and slow

AFAIK more risk is to make them dumb and cryptic as old-style rewrites )

Here is a set of existing tools of the sort, hope it can be a source for further inspiration https://stackoverflow.com/questions/1618038/xslt-equivalent-for-json/49011455#49011455

janl commented 6 years ago

schema, whatever implementation is, better have conditionals/guards

@ermouth do you have an example?

ermouth commented 6 years ago

@janl latest json schema has if/then/else

http://json-schema.org/draft-07/json-schema-release-notes.html

mikerhodes commented 6 years ago

I'll suggest the obvious: that the guards be declarative and mango selectors.

I think "condition" may be the right term for the authorization's actual business logic (might be a better term from the authorization literature, but it escapes me right now).

Combining the suggestion to split out the authz section into three fields:

{
    "authorization": [
        {
            "guard": {"type": { "$eq": "post" }},
            "condition": {"author": { "$eq": "$userCtx.name" }},
            "throw": "you can’t update other user’s docs."
        }
    ]
}

ermouth commented 6 years ago

@mikerhodes I already suggested exactly same approach (using Mango selectors as guards/conditions), see above, however it was assumed too clever. Nice to see this obvious and practical way came not only in my head.

What is the difference between guard and condition in you example? And why special authorization field needed; I mean why special field if a list of guard/actionList pairs may be used without pinpointing each pair semantic meaning?

wohali commented 6 years ago

@ermouth I think I misunderstood what you were proposing...probably a language barrier. It's clearer to me now.

I think the guard is checked first, to make sure that the document would even have the field in question, and then the condition is checked. This would translate to JS in such a way:

if (doc.type === "post") {
  if (doc.author != userCtx.name) {
    throw("you can't update other users' docs.")
  }
}

Am I right @mikerhodes ? And if so, perhaps your condition has the wrong logic sense? Not sure...

mikerhodes commented 6 years ago

@wohali Yes, that's the basic logic. I think my condition logic is the wrong way around, because I would say that the throw happens if condition is false rather than if condition is true.

if (doc.type === "post") {   // i.e., "if <guard condition>"
  if (! (doc.author == userCtx.name)) {  // i.e., "if not <condition>"
    throw("you can't update other users' docs.")
  }
}

@ermouth I can see that you could combine the guard and the condition into a single clause quite easily given they are both Mango selectors. I just find that the logical distinction of "the types of thing this authorisation decision applies to" (guard) and "the authorisation condition itself" (condition) makes things more clear.

As to whether schema and authorization are worthy of separate top-level concepts when I could see that you're right again that they could be combined in a kind of guard-condition-action type framework, I'm unsure. Again, it perhaps helps to separate them from a readability point of view, "here's my schema check, here's my authorisation check". Unsure.

garbados commented 5 years ago

I wrote up a gist describing a possible /{db}/_update endpoint for facilitating Mango-based VDUs, based on the discussion in this thread. I based the transform syntax on this secret update syntax in the Mango source. The transform syntax could also be dropped from the proposal entirely, as it adds a great deal of extra work over simply validating an update.

What do y'all think?

garbados commented 5 years ago

After digging into Mango's source and discussing this feature with folks, I think only one new method+endpoint is necessary to facilitate Mango VDUs: POST /{db}/_update, which acts much like POST /{db} but it applies the new doc to the first matching VDU. The ?use_index=... query parameter can be used to force the use of a particular VDU, much like in POST /{db}/_find.

VDUs in this case are Mango indexes with the type field set to vdu. Their form differs from a "type": "json" index in the following ways:

No partial_filter_selector since applying something at "index time" doesn't make sense for this context.
No index.fields, since there is no indexing involved in a VDU.
The field index.guard contains a selector which checks if the VDU applies to a given document. Documents which do not pass this selector are not affected by this VDU.
The field index.condition contains a selector which validates the document against the VDU. The VDU will reject documents which pass its guard but not its condition.
If an update selector syntax is ever developed or implemented, an index.action or index.transform field could be used to apply them to documents which pass both the guard and the condition.

I suggest this approach because mango_idx.erl already contains facilities for supporting new types of indexes, making it much easier to develop a new type of index than to reimplement all the /{db}/_index handlers for a new endpoint.

VDU selectors in this formulation can use string substitution to access $userCtx and $secObj. Accessing $oldDoc requires a fabric call, which extends the window for conflicts to appear, so I would be inclined to permit only index.condition to access it so as to minimize which updates can create lengthy conflict windows.

For example:

{
  "index": {
    "guard": { "type": "article" },
    "condition": {
      "name": { "$in": "$secObj.members.names" },
      "$or": {
        "$not": { "$exists": "$oldDoc.author" },
        "author": { "$eq": "$oldDoc.author" }
      }
    }
  },
  "type": "vdu"
}

Does that make sense? Seem reasonable?

mikerhodes commented 5 years ago

Good to see some movement here, and nice write up.

It feels a little odd to overload the term index in this way -- in particular because, as you say, "there is no indexing involved" 😄 . I'm not sure this UX is going to work out, it feels like putting a VDU as a sibling to a map function. In many ways, this is unrelated to Mango querying so I'd say that, like /_update is new, a new /_vdus or whatever endpoint that's a parallel to /_index is more appropriate.

Another reason to avoid overloading the index term is that the VDU index section has a very different schema to a normal Mango index section too, which makes writing client libraries pretty tough (especially those libraries that map JSON to concrete types, which is a primary utility of a client library over raw HTTP calls IMO).

garbados commented 5 years ago

That makes sense, that overloading /_index with a process that doesn't involve indexing does indeed not make sense. Thanks for that reality check :)

I still like the guard-condition syntax for the VDU, where the guard and condition are selectors with different expectations and abilities. I'll go for another few dives into the source and see if I can't figure out a PR.

garrensmith commented 5 years ago

Diana, this is really cool. If you need some help understanding the code, let me know. I’m happy to help.

From: Diana Thayer notifications@github.com Sent: Wednesday, January 23, 2019 6:11 PM To: apache/couchdb Cc: Subscribed Subject: Re: [apache/couchdb] Additional Mango-based update handler / VDU functionality (#1554)

That makes sense, that overloading /_index with a process that doesn't involve indexing does indeed not make sense. Thanks for that reality check :)

I still like the guard-condition syntax for the VDU, where the guard and condition are selectors with different expectations and abilities. I'll go for another few dives into the source and see if I can't figure out a PR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/apache/couchdb/issues/1554#issuecomment-456862512, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAK9AsNusxhLYYBUvmFubc5WpW1MulBUks5vGIm_gaJpZM4V5aai.

garbados commented 5 years ago

@garrensmith That's very much appreciated. I'd love for some pointers on which files will contain pertinent code, functions I might have to edit or copy, etc. Otherwise I'm just stumbling through the source and hitting my local couch with requests, hoping the responses change in response to my actions.

garrensmith commented 5 years ago

I'm going to try and leave some ideas on how to implement this. The one thing I've noticed is that this is quite a large amount of work, so I think try and break it up into smaller usable PR's. I also think it is worth creating a RFC for this with all the information on this and add in the gist so that its in 1 central place.

I would start on the transform section first. That section could possibly be used in the mango query section. We could take a document we want to index and then transform it before adding it to the index. This has been a requested feature.

For $transform, $schema and $authorization section All of the mango syntax manipulation lives here https://github.com/apache/couchdb/blob/master/src/mango/src/mango_selector.erl

I think it might be best to create a new mango_vdu_selector or something file that uses parts of mango_selector and does all the validation and transformation of a document based on the syntax.

Saving/creating/getching of VDU's: The CRUD for a mango index is here https://github.com/apache/couchdb/blob/master/src/mango/src/mango_idx_view.erl Again I would probably create another file e.g mango_vdu and leverage mango_idx_view to do all the retrieving of the vdu's.

I hope that helps a little.

janl commented 5 years ago

I swear I thought I had mentioned this already, but apparently not on this thread.

I’d leave the transform bits out of this proposal and focus on the VDU part: validate that a doc looks like what you want it to be and comes from someone you want it to be.

I’d leave mango-like syntax for doc transformation to #1559.

garrensmith commented 5 years ago

+1 to reducing the scope of this work.

From: Jan Lehnardt notifications@github.com Sent: Wednesday, February 20, 2019 12:57 PM To: apache/couchdb Cc: garren smith; Mention Subject: Re: [apache/couchdb] Additional Mango-based update handler / VDU functionality (#1554)

I swear I thought I had mentioned this already, but apparently not on this thread.

I’d leave the transform bits out of this proposals on focus on the VDU part: validate that a doc looks like what you want it to be and comes from someone you want it to be.

I’d leave mango-like syntax for doc transformation to #1559https://github.com/apache/couchdb/issues/1559.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/couchdb/issues/1554#issuecomment-465526805, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAK9Alq3mMGDv4fIXGZ8FmeiK12tIDM3ks5vPSo3gaJpZM4V5aai.

apache / couchdb

Additional Mango-based update handler / VDU functionality #1554