Database Triggers - Githubissues

ChristianGruen commented 9 years ago

Allow definition of database triggers to react on changes in a database (creations, updates, deletions).

Open questions:

What is the smallest observable unit? A Resource? Arbitrary XML nodes?
Is it sufficient to work with affected resource paths, or do we also want to cache and return updated nodes?
How are triggers defined? As independent XQuery expressions?
How are errors treated that occur during trigger execution?
- Will the original transaction return an error?
- Will another transaction be triggered?
- Will it only be remembered in the database log files?
If we remove the current database event architecture, do we still want to inform clients on changes, or do we use something else like web sockets?

Database Commands:

OPEN db
TRIGGER INSTALL trigger.xq delete
TRIGGER DELETE trigger.xq
TRIGGER LIST

XQuery:

let $db := 'db'
let $code := 'push-it.xq'
let $operation := 'delete'
return trigger:install($db, $code, $operation)

Trigger Query (trigger.xq):

declare variable $db as xs:string external;
declare variable $resources as xs:string* external;
declare variable $operation as xs:string external;
admin:write-log(
  "Operation: " || $operation ||
  ", Database: " || $db
  ", Resources: " || count($resources)
)

Feedback is welcome!

mgaerber commented 9 years ago

Since resource/document is the unit of storage, this would be a natural fit for triggers
See above, path of document
XQuery expressions or functions would be really nice
Maybe this could be set while registering the trigger?

from SQL

CREATE TRIGGER { NAME }  [ON { RESOURCE }]  [WITH { OPTION }]
FOR | AFTER | INSTEAD OF
INSERT | UPDATE | DELETE

Where RESOURCE would be a single doc or if omitted for every doc (maybe support glob syntax) And OPTION could be used to state such things as how to handle errors

Example:

TRIGGER INSTALL "trigger.xq" ON "foo.xml" AFTER INSERT

dirkk commented 9 years ago

:+1: for database triggers and another :+1: for Max' suggestions

Regarding the trigger code I would be interested how you planned to store the XQuery trigger functions, e.g. in trigger.xq? Are they supposed to reside on the file system or are they parsed and stored within the database itself? I guess the first one is easier, however I like the second one more. I don't care about the performance that much (imho the functionality could also be stored non-parsed), but I guess the user experience is much easier - You don't have to take care where to store the xq-Files and you can delete the files after you created them. For me, a simple copy to some internal directory would be good enough. If not, I would at least opt for a standard directory where all triggers are located.

Also, I would prefer to use function items instead of single xq-files. This would allow - at least in my opinion - more flexibility, shorten the code and with function items being first-level citizens it feels like a natural choice to me. So the example would look like

let $db := 'db'
let $code := function($db, $resources, $operation) {
admin:write-log(
  "Operation: " || $operation ||
  ", Database: " || $db
  ", Resources: " || count($resources)
)
let $operation := 'delete'
return trigger:install($db, $code, $operation)

Using database commands we could allow XQuery expressions, e.g. both would be valid:

TRIGGER INSTALL "Q{http://my-namespace}trigger#3" ON "foo.xml" AFTER INSERT 
TRIGGER INSTALL "function($d, $r, $o) {admin:write-log("Operation: " || $operation || ", Database: " || $db ", Resources: " || count($resources))" ON "foo.xml" AFTER INSERT

This would allow to store all or some trigger functions in one module. Having to write 100 xq files for 100 triggers seems cumbersome to me. It would entail that we also have to create a name, quite likely the structure as in SQL would be sufficient. As we would not implement INSTEAD OF or FOR (at least in the beginning and AFTER could simply be the default) I think the following database command structure should work:

TRIGGER INSTALL <trigger-function> AS <name> ON <glob-resource> [INSERT | UPDATE | DELETE]

Regarding errors in the trigger I would opt for simply logging them. Everything else seems quite complicated to me and I don't think trigger functions should be written in a way that they can fail (instead, error handling should be done within the trigger functions).

Regarding the current event architecture I am all for removing this feature. Having a trigger should enable one to write a connection to some message broker like ZeroMQ, which could notify a client application.

ChristianGruen commented 9 years ago

Thanks for the inspiring feedback… I'm trying to summarize our first ideas:

Smallest Entities

We'll probably settle down on having database resources (XML, binary) as smallest observable entity. Once we choose this path, this may also have effects on other features like backups, replication, distribution, etc. On the other hand, we'll need to gain more experience with large collections and get to know new bottlenecks.

SQL Syntax

The alignment with existing SQL statement makes sense to me. Fortunately, there are many similarities with our first ideas.

Storing Triggers

I like the idea of simply storing trigger query files in the database (i.e., as binary resource):

We generally try to provide self-contained file structures (databases, repository files)
Triggers could also be managed without a TRIGGER command.
We will probably need to reserve a path in each database such that we can automatically locate triggers. .triggers would be an obvious choice.
To make it completely self-contained, we should move remaining semantics (operations, before/after/...) into the query files as well

Trigger Expressions

Using function items for triggers was my first idea as well, but one problem (among others) is that function items can reference other expressions from the original query. See the following example:

  declare variable $global := db:open('haha')

  declare function local:evil() {
    "I'm belonging to the original query"
  };
  let $code := function($db, $resources, $operation) {
    local:evil(), $global
  }
  return trigger:install('db', $code, 'delete')

This would mean that the original query cannot be finalized as long as there still may be another function to be executed. – We have similar problems with xquery:eval or client:query, which are not allowed to return function items for exactly that reason.

But I agree it's ugly to work with things like external variable declarations. A way out could be to represent operations via function names or annotations:

(: 1st approach :)
declare function trigger:delete($db, $res) { ... };

(: 2nd approach :)
declare %trigger:delete function f($db, $res) { ... };

Error Handling

Good point again: As we are using XQuery, it's indeed easy to use try/catch. If things still fail, we can restrict ourselves to logging them.

Trigger Execution

Do you have some preference towards synchronous or asynchronous execution?

If triggers are executed synchronously, an updating query will only be finished after all triggers have been executed. This means that the execution of trigger queries will potentially block other database requests. It also means that we need to extend our locking analysis: It can e.g. happen that the trigger query writes to other databases. As a result, we would need to lock those databases as well before executing the original query.
If we choose asynchronous execution, other queries may be executed while triggers are still running. As a result, a trigger query may not be able to access resources anymore that have been updated by the original query.

dirkk commented 9 years ago

I see the problem with using function items now, thanks for the explanation.

I prefer the second approach using annotations, simply because I think it is more flexible and also a nicer syntax. If I understood you correctly, you want to move all functionality into the functions itself, so that they are self-contained. So wouldn't it make sense to move all functionality there? So e.g. we could have our file trigger.xq within the .trigger directory:

declare 
  %trigger:delete
  %trigger:on("resource.xml")
function f($db, $res) { ... };

This way, we could have a simple TRIGGER INSTALL command, which parses these files when called. Of course, we still could have convenience operations for installing and updating triggers in the way already proposed here.

Regarding the trigger execution I would definitely go for synchronous execution. Three reasons:

Asynchronous execution is very surprising and might be useless if your updated nodes have been changed by another query in the meantime. It is hard to figure this out from an application perspective.
Using synchronous execution, you can once again call some message broker or some other mechanism to have an asynchronous execution. But if we implement it in an asynchronous way, the user can't make it synchronous.
It's what all relational databases I am aware of are doing. It might be wise not to ignore 40+ years of research

ChristianGruen commented 9 years ago

So wouldn't it make sense to move all functionality there?

Right, that's what I would propose. Your example makes sense.

This way, we could have a simple TRIGGER INSTALL command, which parses these files when called.

Exactly. It would primarily be used to validate the trigger queries. Apart from that, it would do nothing else than copying the files to the database directory.

Regarding the trigger execution I would definitely go for synchronous execution.

In that case, it could make sense not to attach trigger queries to specific databases, but instead provide a single, global .trigger directory:

If we attach trigger queries to databases, our existing locking mechanism will fail, because we would first need to look into the database in order to find out if the trigger function will eventually lock other databases.
Beside that, you can access any other database from a trigger function.

mgaerber commented 9 years ago

+1 for the current outcome of the discussion.

While re-reading some SQL trigger spec, I think we need to clarify what the "ON " part really means.

In SQL terms it reacts on row level (insert, delete) operations. While there are other kinds of triggers for table level (drop, alter) operations.

So "ON " would currently mean: fire the trigger if something happens (e.g. insert nodes) with these resources

Question: Would the trigger "ON foo*.xml" also fire if a document named foo123.xml was added to database?

This could be solved by just adding another kind: ADD (resource) : [ADD | INSERT | UPDATE | DELETE]

Still, DELETE remains ambiguous (delete some node or delete resource?).

ChristianGruen commented 9 years ago

First of all, if we decide not to restrict triggers to databases, we could add two database-specific operations:

Fire CREATE if a database is created
Fire DROP if a database is dropped

Next, I propose that all update operations should be summarized on resource level:

Fire ADD if a document is added
Fire UPDATE if anything in a document is changed
Fire DELETE if a document is deleted
Potentially, fire REPLACE if a document is replaced

Otherwise, multiple updates on a documents would usually lead to numerous trigger calls:

db:open('db', 'doc.xml') update (
  delete node .//text(),  (: → one DELETE call per text node? :)
  //a ! (replace value of node . with 'new text') (: → one UPDATE call per element? :)
)
(: → I think a single UPDATE call should suffice :)

Here are some examples how the resulting annotations could look like (I decided to merge the discussed annotations into one):

(: Called when any new database is created :)
declare %trigger:create function create-any($db) { ... };
(: Called when a db is dropped whose name starts with "db" :)
declare %trigger:drop("db*") function create($db) { ... };
(: Is called when an XML resource is updated in the database "books" :)
declare %trigger:update("books", "*.xml") function update($db, $res) { ... };
(: Is called when an image is deleted from any database :)
declare %trigger:delete("*", "*.jpg,*.gif,*.bmp") function delete($db, $res) { ... };

Those would be the remaining trigger commands and functions:

Commands:

# Checks and installs the specified trigger module
TRIGGER INSTALL /local/path/to/trigger.xqm
# Removes a trigger module
TRIGGER DELETE trigger.xqm
# Lists the triggers defined in the modules
TRIGGER LIST

Functions:

trigger:install('/local/path/to/trigger.xq'),
trigger:delete('trigger.xq')
trigger:list()
trigger:list-details()

ChristianGruen commented 9 years ago

One more thought: We may need to sort out if the fired events relate to the original query or to the optimized update operations. Some examples:

The deletion and insertion of a document could possibly be rewritten to a replace operation. As a result, the following two queries might be equivalent:

(: 1 :) (db:delete('db', 'doc.xml'), db:add('db', 'doc.xml')
(: 2 :) db:replace('db', 'doc.xml', <new-doc/>)

The replace operation, or the delete and add operations, may even be optimized away if it turns out that the source and the target documents are identical. This can e.g. happen if a user clicks on a SAVE button without changing any contents.

This leads to questions like: Do we want to fire events if the database contents have not really changed?

apb2006 commented 9 years ago

So this is triggerXQ; annotated based, but unlike RESTXQ and like the REPO, requires installation? In the case when multiple triggers "trigger" - in what order do they run?

On 20 February 2015 at 21:32, Christian Gruen notifications@github.com wrote:

One more thought: We may need to sort out if the fired events relate to the original query or to the optimized update operations. Some examples:

The deletion and insertion of a document could possibly be rewritten to a replace operation. As a result, the following two queries might be equivalent:

(: 1 :) (db:delete('db', 'doc.xml'), db:add('db', 'doc.xml')(: 2 :) db:replace('db', 'doc.xml', )

The replace operation, or the delete and add operations, may even be optimized away if it turns out that the source and the target documents are identical. This can e.g. happen if a user clicks on a SAVE button without changing any contents.

This leads to questions like: Do we want to fire events if the database contents have not really changed?

— Reply to this email directly or view it on GitHub https://github.com/BaseXdb/basex/issues/1082#issuecomment-75322862.

ChristianGruen commented 9 years ago

So this is triggerXQ; annotated based, but unlike RESTXQ and like the REPO, requires installation?

Nice name ;) I think we can automatically scan the .trigger directory, and parse new modules or discard deleted ones, after the successful execution of an updating query.

In the case when multiple triggers "trigger" - in what order do they run?

Good question. I just looked up the PostgreSQL documentation. It states that "SQL specifies that multiple triggers should be fired in time-of-creation order. PostgreSQL uses name order, which was judged to be more convenient.". We could order all triggers by:

the names of the trigger modules (triggers1.xq, triggers2.xq, …) and
their line position in that file (or, alternatively, their function name)

dirkk commented 9 years ago

I would advise against using the line position and instead use the function name. I would find it highly surprising if the order of functions in a file (or line spaces or indentations) in XQuery should be relevant, i.e. reordering my code would change my program. Using the function names as a default order is less surprising to me.

Additionally, one could think of adding an optional %trigger:priority(xs:integer) annotation, with functions not having this annotation having the lowest possible priority.

dcore94 commented 9 years ago

Nice to hear about triggers. Especially if they can be catched from inside XQuery itself (unlike former db:event). Btw, annotation was my first idea too so I definitely prefer it over filenames. One good thing of db:event was the fact that it could be fired explicitly without depending necessarily on an update operation of a resource. This enabled loosely coupled integration of aspects like auditing, building access stats, hot/cold database management.... Maybe you could evaluate the possibility of extending the trigger API adding something like a generic event trigger (one that has a name but is not bound to specific resources or dbs)?

trigger:event($name as xs:string, $data as item()*)

declare %trigger:event("*") %trigger:event-data("{$eventdata}") function e($db, $eventdata) { ... };

dirkk commented 9 years ago

One of the reasons why I don't particularly liked the old event architecture is that such a loosely coupled integration is in my opinion not a job for a database system or query processor. What you can already do is to use some 3rd party tool like a message queue (ZeroMQ, ...), which could trigger such events and could get notified. Given that such tools already exist, are highly performant and can be used to achieve the same effect (and much more) I personally would not re-add such an event trigger.

ghost commented 9 years ago

5 . If we remove the current database event architecture, do we still want to inform clients on changes, or do we use something else like web sockets?

Ideally, clients should be informed on changes. Personally, I kind of like the web sockets approach, where the client would register to a BaseX web socket server and listen for the database events. But anything that is handy and easy to setup on the end is welcome.

xokomola commented 9 years ago

I think it's fair to say that a database such as BaseX is both a database engine but also a platform for programming general applications (due to the nature of XQuery and web facilities such as RESTXQ).

DBA's probably tend to the database engine side and would find most value in the lower-level database triggers as tools to strictly guard the data.

Web application developers would probably tend to see most value in a more loosely coupled, flexible, event system.

The question in my mind is if both concerns should be reconciled into one set of eventing/trigger facilities or if they are so different that they require different facilities.

When it comes to database triggers (the limited scope) I would want

Enable triggers to cancel transactions
Run before and/or after the operation
Run synchronously (is already implicit in the first item)

Most of these points were already raised above but there were also points made in the direction of a more general event system. I also see these triggers, apart from the similarity with SQL triggers, similar to Git hooks. General advice is to keep them very short and simple (because they kind of run synchronously and inside a transaction). In an XML database context I could imagine they can be used to handle some XML validation.

Maybe, maybe there should also be a way of notifying others of these events (but only after the operation and asynchronously). But I think this would already lead to the second and wider scope of a general event system which has the added complication of running code asynchronously.

xokomola commented 9 years ago

Ah, one thing about "smallest entities".

I believe the current consensus is on not addressing individual nodes within a database resource and that's probably sufficient. However there's one level besides database resources and individual XML nodes that hasn't been addressed I think. Suppose I have order XML files stored in a database with a path "acme/orders/". Although the path is "just" a string it does express virtual collections. With db:open or fn:collection() I can get all order document nodes by specifying the path to it's directory or virtual collection. Would I be able to put a database trigger only on such a directory so that each new order or update on an existing order causes an event? I ask this because currently it seems that the path is just a string and the fact that db:open returns all document nodes under a specific path is just a simple convenience but otherwise has no explicit representation in the database.

ghost commented 9 years ago

I think it's fair to say that a database such as BaseX is both a database engine but also a platform for programming general applications (due to the nature of XQuery and web facilities such as RESTXQ).

Yes, in the case of web applications, there needs ways not only to trigger events but also to react to database events on the client-side. That's why I think a web socket implementation for triggers inside BaseX would be a nice solution.

Would I be able to put a database trigger only on such a directory so that each new order or update on an existing order causes an event?

To my understanding, according to the draft implementation, you could filter the events through a module and therefore trigger actions on specific documents updates, like on a specific path or directory. It would need to support the glob syntax I guess.

ChristianGruen commented 1 year ago

Discarded (discussion has stopped long time ago).

BaseXdb / basex

Database Triggers #1082