Closed rzbhatti closed 7 years ago
+1
+1
ragel is GPL code, though it seems the generated output is not covered by GPL.
Will the toolkit include any GPL code or depend on any libraries from ragel?
No toolkit will be using our own code for dynamic loading of the rstm tables and run-time scanning.
It's possible that the compiler script could be seen as GPL. Is there a need for a script to run ragel, can't the user just do that?
The set of target patten is provided by user in a file e.g. user_pattern.set The script does the following things (1) Produces a "pattern.rl" , ragel input file from user_pattern.set (2) run ragel over pattern.rl to produce state transition tables (3) transition tables are packed into a dynamically loadable binary format rstm.bin
The run time loads the rstm.bin and scan the incoming string using our own code.
I certainly like the idea of the toolkit but want to confirm that you have appropriate approvals from IBM to release the code and are handling interactions with GPL code appropriately.
Kris suggested me to clarify here that the toolkit has a library that contains the rstm runtime scanner core for which the source will not be provided.
+1
@rzbhatti Can you expand on the comment about toolkit has a library that contains the rstm runtime scanner core for which the source will not be provided
Who produces that library, is it GPL, because if it's GPL the source has to be made available?
@rzbhatti - the last we discussed, we are checking on the licensing terms to make sure that we can contribute this code.
Can you pls give us an update on what the status is with this proposal and if you still want to move forward?
Thanks!
@rzbhatti wondering if you have an update here?
Are you still interested in contributing this project?
Thanks
Closing due to lack of activity. Please reopen if you are still interested. Thanks.
Introduction
A lightweight text search and match toolkit for high performance streaming applications is proposed here. The toolkit uses a reduced state transition matrix (rstm) technique (described below) for scalable and dynamic runtime operation. It is pertinent to most common enterprise scale real-time, Big Data streaming applications like cyber security, telco, digital content providers, and log analytics applications etc.
Motivation
In a typical scenario a stream containing a text (rstring) field is filtered, searched or matched against a set of target patterns. Where the set of patterns may be one or mixture of the following:
It may be very obvious that a simple loop implantation of partial string or regular expression matching, over a set of target patterns, is not a scalable solution. The performance of this sort of operation becomes highly dependent on the size of the target set of patters. A streams operator, implemented like this, working with a large set of target patterns could clearly be bottleneck. If the given set of target patterns is pre-compiled into a custom primitive streams operator, then removing a pattern from or adding to the pre-compiled operator may not be possible in run time.
Proposed Streams Toolkit based on “rstm”
In the proposed solution, the given set of target patterns are compiled into a reduced state transition matrix (rstm) binary. This rstm binary is dynamically loadable by the rstm runtime scanner. The performance of the run time remains independent of the size of the set of target patterns. The toolkit provides following:
Understanding “rstm”
A given set of target patterns is compiled into a deterministic finite automat (DFA) using a finite state machine (FSM) compiler (like re2c or ragel). The FSM is converted into an rstm binary for dynamic loading and high performance runtime data scanning.
Example
Whitelisting DNS requests in a malicious url detection application. Top 500 domain names from Alexa.com are used to create a set of whitelisted domain URLs. The domain field of the incoming TDR tuples is matched.