Closed leongor closed 9 years ago
+1
I have an alpha version ready. There are a lot of use cases to cover, so I'll be glad to get feedbacks and proposals.
Do you have something describing the interface?
SmartParser expects on input port to get one of the attributes to be blob or rstring. The output port schema defines how the input is parsed.
These are the parameters: batch param: boolean - if expected 1 tuple to parse or multiple ones delimiter param: rstring - what is the delimiter between the fields skipper param: enum (none, blank, endl, whitespace) - which characters (like spaces or newlines) to skip
Additionally, there are a couple of custom output functions like CustomParser, CustomMapParser and etc. that help to customize the parsing of the specific field.
+1 though not keen on the name SmartParser, "Smart" doesn't really tell me what it does. Note also issue #30 , indicating a direction of parsers in specific toolkits, but not sure how to describe the formats it is handling, "flexible", "arbitrary", "proprietary" ?
Will there be a matching Format function as well?
I thought something like smart pointers - the parser deduce the parsing logic from tuple schema itself. It's not final of course and open for the discussion.
About Format - it depends. I'm not sure how it's valuable, because usually the output can be always saved as one of the standard formats.
I guess Format would be needed if there's a requirement to write back to the systems producing the "smart" formats. Can be added when someone has that itch.
Agree.
I do not want to create a generic streamsx.parser and I agree with Dan that streamsx.smartparser is not really good either.
Some ideas: streamsx.blobParser streamsx.stringParser streamsx.CSVParser streamsx.LineParser
We could start the generic toolkit streamsx.parser but this will simply force some sort of namespace attempt to categorize this anyway.
If we fail to come up with anything good for a name I will select a default one in order to proceed with getting this function. We can always rename later if someone has a good idea.
Some variants:
None of these cases are about parsing blobs, so I vote against "blobParser", and it's not exactly csv, so I vote against "CSVParser". Likewise, "LineParser" implies the data arrives as one line, which it does not, it at least one example. I don't like "universalParser" because it's not universal, I don't like "GenericParser" because it's not generic.
How about something like StructuredTextParser? Or maybe ComplexTupleReader or ComplexTextParser?
I'm not sure - it's mainly for text parsing, but can be used to parse binaries too. BTW, what's about AutoParser?
Well, we need a name, and I don't have any strong objection to "autoparser", but why is it more automatic that the standard toolkit Parse? The key difference I see is that it's more flexible and supports wider formats (nested tuples, maps, lists) than the standard parser, so an ideal name would reflect that.
Right. AgileParser FlexieParser
FlexieParser or some variant of "flexible" seems good to me.
I agree with @ddebrunner.
Regarding the new ElasticLoadBalance operator - maybe ElasticParser then?
No to elastic, for the load balancer, elastic has a specific meaning for cloud computing, that's not what a parser is doing
I've found a good one - TemplateParser. 1 reason: The output tuple is used as a template for building the parser grammar. 2 reason: I'm going to create samples (lately we can build even repository) for existing formats (kind of ready format templates), wrapping the parser with SPL composite operator. The first one I have created already - Arcsight (CEF) parser.
What do you think?
TupleAdaptiveParser or AdaptiveParser
Like to close on this....I am voting for TemplateParser the other option I think is AdaptiveParser
Using it already in some project as AdaptiveParser, but TemplateParser is ok for me too.
Created streamsx.adaptiveParser.
One of the repeating tasks of each PoC/Project I've worked with Streams is a data ingestion. Yes, CSV format is very popular and for small PoT's is a good choice, but in the field a lot of customers/products have their proprietary formats and integrating Streams with them can take 20-30% time from entire PoC. Another issue - CSV produces a flat tuple format, so if a data consists from 100 fields then the tuple should be defined with 100 attributes. SmartParser toolkit comes to ease the parsing, tuple structure definition as well as tuple mapping development steps. The toolkit allows to parse custom formats producing desired hierarchical tuple (including lists, maps and even sets) saving the common step of mapping the flat format to the required Streams tuple.
Example 1: Custom data format - fatherName, motherName, childName* (0 or more childs) Streams tuple format - tuple< rstring fatherName, rstring motherName, list childs >
Example 2: Custom data format - key1 : value1, key2 : value3, key3 : value3 Streams tuple format - tuple< map<rstring,rstring> keyValue >
Example 3: Custom data format - personal data: \n business data:
Streams tuple format - tuple< tuple personalData, tuple businessData >