IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Proposal for a new project: streamsx.rstm #69

Closed rzbhatti closed 7 years ago

rzbhatti commented 9 years ago

Introduction

A lightweight text search and match toolkit for high performance streaming applications is proposed here. The toolkit uses a reduced state transition matrix (rstm) technique (described below) for scalable and dynamic runtime operation. It is pertinent to most common enterprise scale real-time, Big Data streaming applications like cyber security, telco, digital content providers, and log analytics applications etc.

Motivation

In a typical scenario a stream containing a text (rstring) field is filtered, searched or matched against a set of target patterns. Where the set of patterns may be one or mixture of the following:

  1. Partial keywords
  2. Begin patterns
  3. End patterns
  4. General regular expressions
  5. Binary data patterns

It may be very obvious that a simple loop implantation of partial string or regular expression matching, over a set of target patterns, is not a scalable solution. The performance of this sort of operation becomes highly dependent on the size of the target set of patters. A streams operator, implemented like this, working with a large set of target patterns could clearly be bottleneck. If the given set of target patterns is pre-compiled into a custom primitive streams operator, then removing a pattern from or adding to the pre-compiled operator may not be possible in run time.

Proposed Streams Toolkit based on “rstm”

In the proposed solution, the given set of target patterns are compiled into a reduced state transition matrix (rstm) binary. This rstm binary is dynamically loadable by the rstm runtime scanner. The performance of the run time remains independent of the size of the set of target patterns. The toolkit provides following:

  1. Compiler script to generate rstm binary file from a given set of target patterns
    • uses ragel an open source finite state machine compiler to compile rstm tables
  2. A C++ Primitive RSTM operator. Core rstm run time implementation
  3. SPL composites. The following composites use the core rstm operator in different configurations
    • RstmKeywords
    • RstmRegex
    • RstmBeginPattern
    • RstmEndPattern
    • RstmBinaryPattern

      Understanding “rstm”

A given set of target patterns is compiled into a deterministic finite automat (DFA) using a finite state machine (FSM) compiler (like re2c or ragel). The FSM is converted into an rstm binary for dynamic loading and high performance runtime data scanning.

Example

Whitelisting DNS requests in a malicious url detection application. Top 500 domain names from Alexa.com are used to create a set of whitelisted domain URLs. The domain field of the incoming TDR tuples is matched.

namespace sample ;
use com.ibm.streamsx.rstm::RstmRegexMatch ;

type testDataType = 
    rstring protocol,
    rstring ip,
    rstring dest_port,
    rstring source_port,
    rstring user_agent,
    rstring host,
    rstring uri;

type testMatchOutputType = 
    rstring protocol,
    rstring ip,
    rstring dest_port,
    rstring source_port,
    rstring user_agent,
    rstring host,
    rstring uri,
    boolean matchFound,
    list<int32> matchIndices,
    list<int32> matchOffsets,
    list<rstring> matchTargets;

composite regexAlexaWhiteList
{
    graph
        (stream<testDataType> httpMetaData) as FSource = FileSource()
        {
            param
            format : csv ;
          file : "testData.csv" ;
        }

    (stream<testMatchOutputType> httpMetaDataOut) as RSTM = RstmRegexMatch(httpMetaData)
        {
            param
            targetString : host ;
          rstmFileName : "alexaWhiteList.rstm.bin" ;
          regexFileName : "alexaWhiteList.pcre" ;
          rstmUpdateCheckInterval : 5.0 ;
            output
            httpMetaDataOut :matchFound = matchFound(), matchIndices = matchIndices(),
            matchTargets = matchTargets(), matchOffsets = matchOffsets() ;
        }

    () as Sink = FileSink(httpMetaDataOut)
        {
            param
            file : "outPutFile.csv" ;
          format : csv ;
        }
}
hildrum commented 9 years ago

+1

leongor commented 9 years ago

+1

ddebrunner commented 9 years ago

ragel is GPL code, though it seems the generated output is not covered by GPL.

Will the toolkit include any GPL code or depend on any libraries from ragel?

rzbhatti commented 9 years ago

No toolkit will be using our own code for dynamic loading of the rstm tables and run-time scanning.

ddebrunner commented 9 years ago

It's possible that the compiler script could be seen as GPL. Is there a need for a script to run ragel, can't the user just do that?

rzbhatti commented 9 years ago

The set of target patten is provided by user in a file e.g. user_pattern.set The script does the following things (1) Produces a "pattern.rl" , ragel input file from user_pattern.set (2) run ragel over pattern.rl to produce state transition tables (3) transition tables are packed into a dynamically loadable binary format rstm.bin

The run time loads the rstm.bin and scan the incoming string using our own code.

mikespicer commented 9 years ago

I certainly like the idea of the toolkit but want to confirm that you have appropriate approvals from IBM to release the code and are handling interactions with GPL code appropriately.

rzbhatti commented 9 years ago

Kris suggested me to clarify here that the toolkit has a library that contains the rstm runtime scanner core for which the source will not be provided.

engebret commented 8 years ago

+1

ddebrunner commented 8 years ago

@rzbhatti Can you expand on the comment about toolkit has a library that contains the rstm runtime scanner core for which the source will not be provided

Who produces that library, is it GPL, because if it's GPL the source has to be made available?

chanskw commented 8 years ago

@rzbhatti - the last we discussed, we are checking on the licensing terms to make sure that we can contribute this code.

Can you pls give us an update on what the status is with this proposal and if you still want to move forward?

Thanks!

chanskw commented 8 years ago

@rzbhatti wondering if you have an update here?

Are you still interested in contributing this project?

Thanks

chanskw commented 7 years ago

Closing due to lack of activity. Please reopen if you are still interested. Thanks.