Unified Log - Githubissues

We want to get data from Google App Engine into a Unified Log (External to GAE). This Log can then be used by other systems that needs the data for further processing or reporting.

As an initial prototype we will try to keep a CartoDB Postgres database up to date with the Log as new data continously arrives.

Initial approach

To get data out of GAE simply hook into the various save methods in BaseDAO and post each change to the log.
Use a table in a postgres db as a "unified log". The table can possibly be as simple as a single column with JSON data.
Other processes can either read the transaction log (https://github.com/eulerto/wal2json) or query the table for new data periodically.

Useful links

Building a FIFO queue in google app engine

There are two types of queues in GAE: push queues (https://cloud.google.com/appengine/docs/java/taskqueue/overview-push) and pull queues (https://cloud.google.com/appengine/docs/java/taskqueue/overview-pull).

At the moment, we use push queues: a task is published, and at some point executed by google app engine. With push queues, there is no FIFO guarantee: tasks can fail and be rescheduled, in which case other tasks are executed first, or when a queue is congested, newer tasks might be served first, to reduce latency.

Pull queues work differently: in GAE, you publish tasks to a queue. They just sit there, until an external proces (in our case, flow services), uses an API to 'claim' tasks (up to 1000 at a time), which are then delivered to it. The external service can then execute them, and if successfull, it needs to use the API to delete them from the GAE queue. Whether the API delivers tasks in FIFO order seems to be a matter of some debate (http://stackoverflow.com/questions/12422094/is-the-pull-queue-in-gae-exhibit-consistent-fifo-behavior on the one hand, and "The API returns the specified number of tasks in order of the oldest task ETA." from the GAE documentation).

It seems that push queues are out of the question, and pull queues could work, depending on the fifo issue.

If we were to use a pull queue, flow services would need to know when it needs to query each flow instance. Continues polling would not work, as it would keep all FLOW instances awake 24 hours a day, which is not necessary (and will cost money). One option might be to notify flow services when new events are availble, so it starts reading them. The good thing about a pull queue is that flow services needs to explicitly read each event and delete it, in order for them to get processed, which makes the system more robust, I guess.

akvo / akvo-product-design

Unified Log #67

Building a FIFO queue in google app engine