akvo / akvo-product-design

Products Design Documents
GNU Affero General Public License v3.0
12 stars 9 forks source link

Unified Log #67

Closed jonase closed 8 years ago

jonase commented 9 years ago

We want to get data from Google App Engine into a Unified Log (External to GAE). This Log can then be used by other systems that needs the data for further processing or reporting.

As an initial prototype we will try to keep a CartoDB Postgres database up to date with the Log as new data continously arrives.

Initial approach

Useful links

mtwestra commented 9 years ago

Building a FIFO queue in google app engine

There are two types of queues in GAE: push queues (https://cloud.google.com/appengine/docs/java/taskqueue/overview-push) and pull queues (https://cloud.google.com/appengine/docs/java/taskqueue/overview-pull).

At the moment, we use push queues: a task is published, and at some point executed by google app engine. With push queues, there is no FIFO guarantee: tasks can fail and be rescheduled, in which case other tasks are executed first, or when a queue is congested, newer tasks might be served first, to reduce latency.

Pull queues work differently: in GAE, you publish tasks to a queue. They just sit there, until an external proces (in our case, flow services), uses an API to 'claim' tasks (up to 1000 at a time), which are then delivered to it. The external service can then execute them, and if successfull, it needs to use the API to delete them from the GAE queue. Whether the API delivers tasks in FIFO order seems to be a matter of some debate (http://stackoverflow.com/questions/12422094/is-the-pull-queue-in-gae-exhibit-consistent-fifo-behavior on the one hand, and "The API returns the specified number of tasks in order of the oldest task ETA." from the GAE documentation).

It seems that push queues are out of the question, and pull queues could work, depending on the fifo issue.

If we were to use a pull queue, flow services would need to know when it needs to query each flow instance. Continues polling would not work, as it would keep all FLOW instances awake 24 hours a day, which is not necessary (and will cost money). One option might be to notify flow services when new events are availble, so it starts reading them. The good thing about a pull queue is that flow services needs to explicitly read each event and delete it, in order for them to get processed, which makes the system more robust, I guess.