SmartParser toolkit - Githubissues

leongor commented 9 years ago

One of the repeating tasks of each PoC/Project I've worked with Streams is a data ingestion. Yes, CSV format is very popular and for small PoT's is a good choice, but in the field a lot of customers/products have their proprietary formats and integrating Streams with them can take 20-30% time from entire PoC. Another issue - CSV produces a flat tuple format, so if a data consists from 100 fields then the tuple should be defined with 100 attributes. SmartParser toolkit comes to ease the parsing, tuple structure definition as well as tuple mapping development steps. The toolkit allows to parse custom formats producing desired hierarchical tuple (including lists, maps and even sets) saving the common step of mapping the flat format to the required Streams tuple.

Example 1: Custom data format - fatherName, motherName, childName* (0 or more childs) Streams tuple format - tuple< rstring fatherName, rstring motherName, list childs >

Example 2: Custom data format - key1 : value1, key2 : value3, key3 : value3 Streams tuple format - tuple< map<rstring,rstring> keyValue >

Example 3: Custom data format - personal data: \n business data: Streams tuple format - tuple< tuple personalData, tuple businessData >

hildrum commented 9 years ago

+1

leongor commented 9 years ago

I have an alpha version ready. There are a lot of use cases to cover, so I'll be glad to get feedbacks and proposals.

hildrum commented 9 years ago

Do you have something describing the interface?

leongor commented 9 years ago

SmartParser expects on input port to get one of the attributes to be blob or rstring. The output port schema defines how the input is parsed.

These are the parameters: batch param: boolean - if expected 1 tuple to parse or multiple ones delimiter param: rstring - what is the delimiter between the fields skipper param: enum (none, blank, endl, whitespace) - which characters (like spaces or newlines) to skip

Additionally, there are a couple of custom output functions like CustomParser, CustomMapParser and etc. that help to customize the parsing of the specific field.

ddebrunner commented 9 years ago

+1 though not keen on the name SmartParser, "Smart" doesn't really tell me what it does. Note also issue #30 , indicating a direction of parsers in specific toolkits, but not sure how to describe the formats it is handling, "flexible", "arbitrary", "proprietary" ?

Will there be a matching Format function as well?

leongor commented 9 years ago

I thought something like smart pointers - the parser deduce the parsing logic from tuple schema itself. It's not final of course and open for the discussion.

leongor commented 9 years ago

About Format - it depends. I'm not sure how it's valuable, because usually the output can be always saved as one of the standard formats.

ddebrunner commented 9 years ago

I guess Format would be needed if there's a requirement to write back to the systems producing the "smart" formats. Can be added when someone has that itch.

leongor commented 9 years ago

Agree.

petenicholls commented 9 years ago

I do not want to create a generic streamsx.parser and I agree with Dan that streamsx.smartparser is not really good either.

Some ideas: streamsx.blobParser streamsx.stringParser streamsx.CSVParser streamsx.LineParser

We could start the generic toolkit streamsx.parser but this will simply force some sort of namespace attempt to categorize this anyway.

If we fail to come up with anything good for a name I will select a default one in order to proceed with getting this function. We can always rename later if someone has a good idea.

leongor commented 9 years ago

Some variants:

AutoParser
GenericParser
UniversalParser

hildrum commented 9 years ago

None of these cases are about parsing blobs, so I vote against "blobParser", and it's not exactly csv, so I vote against "CSVParser". Likewise, "LineParser" implies the data arrives as one line, which it does not, it at least one example. I don't like "universalParser" because it's not universal, I don't like "GenericParser" because it's not generic.

How about something like StructuredTextParser? Or maybe ComplexTupleReader or ComplexTextParser?

leongor commented 9 years ago

I'm not sure - it's mainly for text parsing, but can be used to parse binaries too. BTW, what's about AutoParser?

hildrum commented 9 years ago

Well, we need a name, and I don't have any strong objection to "autoparser", but why is it more automatic that the standard toolkit Parse? The key difference I see is that it's more flexible and supports wider formats (nested tuples, maps, lists) than the standard parser, so an ideal name would reflect that.

leongor commented 9 years ago

Right. AgileParser FlexieParser

ddebrunner commented 9 years ago

FlexieParser or some variant of "flexible" seems good to me.

hildrum commented 9 years ago

I agree with @ddebrunner.

leongor commented 9 years ago

Regarding the new ElasticLoadBalance operator - maybe ElasticParser then?

ddebrunner commented 9 years ago

No to elastic, for the load balancer, elastic has a specific meaning for cloud computing, that's not what a parser is doing

leongor commented 9 years ago

I've found a good one - TemplateParser. 1 reason: The output tuple is used as a template for building the parser grammar. 2 reason: I'm going to create samples (lately we can build even repository) for existing formats (kind of ready format templates), wrapping the parser with SPL composite operator. The first one I have created already - Arcsight (CEF) parser.

What do you think?

Yifat-Yulevich commented 9 years ago

TupleAdaptiveParser or AdaptiveParser

petenicholls commented 9 years ago

Like to close on this....I am voting for TemplateParser the other option I think is AdaptiveParser

leongor commented 9 years ago

Using it already in some project as AdaptiveParser, but TemplateParser is ok for me too.

petenicholls commented 9 years ago

Created streamsx.adaptiveParser.

IBMStreams / administration

SmartParser toolkit #50