IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Operator to receive live changes from databases through InfoSphere CDC #56

Closed fketelaars closed 9 years ago

fketelaars commented 9 years ago

Jerome Chailloux and I wrote a custom operator to handle database changes replicated through InfoSphere Data Replication CDC. The operator accepts record changes (inserts, updates, deletes) which are replicated by the CDC product and can them use the tuples in a Streams application, for example to keep a reference table up to date or to generate events based on database changes. An article describing the functionality can be found here:http://www.ibm.com/developerworks/data/library/techarticle/dm-1408streams-livedata/index.html.

The operator we wrote has been used in a POC for a customer in Europe and is currently being deployed in their application. We're also working on additional customer situations requiring this operator.

mikespicer commented 9 years ago

I recall that Jerome had proposed some CDC specific operators in the past which overlapped with functionality to the messaging toolkit operators. In that case we extended the existing messaging toolkit operators to handle what CDC needed. Can you please summarise what this toolkit contains (with spldoc if you have it) and how it is different from existing toolkits.

fketelaars commented 9 years ago

You are correct that the extended messaging toolkit helps to provide integration between CDC and a Streams application, from a functional perspective and when transaction volume is relatively low. When using message queues (JMS), CDC converts the database changes into XML messages and then places them on the queue. The maximum throughput we have been able to achieve was ~1,500 operations per second.

There are a couple of other approaches we have validated:

The first option renders high throughput and low latency. The main drawback here is that if the TCP/IP communications between the CDC engine and the Streams application is interrupted, transactions will get lost.

The second option also provides sufficient throughput but latency will be higher. Latency can be lowered by switching the flat files more often (1 second is the minimum) but throughput will then be significantly lower.

When designing the CDC toolkit we have looked at a number of aspects:

We have defined two operators: CDCSource and CDCParse.

CDCSource listens on a TCP/IP port for incoming "CDC" connections. The operator receives all data changes but also commit operations and "handshakes". The main purpose of the handshake is to feed back to CDC that the operations have been received and that the bookmark that CDC keeps can be forwarded/committed. If the handshake is not successful, the CDC side does not advance the bookmark and the replication stops. This ensures that no transactions get lost in case of a communications issue between CDC and the Streams application. In a future release, we are also thinking about using the handshake to provide end-to-end resilience but this requires more investigation.

CDCParse parses the tuples coming from the CDCSource operator and extracts the columns depicted in the output port. From a development/usability perspective, CDCParse reduces the chances of making errors. Source tables can sometimes be very wide (300+ columns, resulting in 600+ tuple fields). CDCParse reads the CDC configuration and extracts only those columns which have been defined in the output tuple. Additionally, if the structure of the replicated table changes and the columns that Streams uses are not affected, the Streams application may not have to be updated or even stopped.

In a customer test, we have seen throughput rates of 28k operations per second when using the above operators (most CDC implementations require much lower throughput rate).

mikespicer commented 9 years ago

+1 for adding this toolkit

rrea commented 9 years ago

+1 for adding this toolkit - a great additional way to acquire 'events' based on data base transactions. Thank you, Frank!

chanskw commented 9 years ago

+1

chanskw commented 9 years ago

Created this repository: https://github.com/IBMStreams/streamsx.cdc fketelaars and Jerome are initial committers