IBMStreams / streamsx.json

Toolkit for working with JSON in SPL applications.
http://ibmstreams.github.io/streamsx.json/
Other
3 stars 19 forks source link

Two new native functions to parse JSON #81

Closed leongor closed 7 years ago

leongor commented 7 years ago

Two new native functions to parse JSON (additionally to format JSON native functions). Updated rapidJSON to the latest version.

chanskw commented 7 years ago

Does this introduce a new dependency to rapidJSON here: https://github.com/miloyip/rapidjson

Can you please give us example on how to use these functions? Also any build changes required to pull in the new dependency?

leongor commented 7 years ago

@chanskw excellent questions! Because it shows that no one is really aware that there are already 3 native functions exist in the toolkit (to format tuple to json) and rapidJSON library is a part of the toolkit. There is no description of them in wiki and only going directly to SPLDoc Functions can actually discover them. filereadstream.h showed up because I updated rapidJSON to the latest version. This PR was opened to start discussion about the new functions. SPLDoc will be added once decided to accept them.

chanskw commented 7 years ago

Interesting! I truly had no idea. Need to think about this.

ddebrunner commented 7 years ago

Maybe the PR could be split into an update of rapid json and one for the new functions.

Hard to review a pr with 38 files changed.

leongor commented 7 years ago

@ddebrunner Good idea. I'll split it.

ddebrunner commented 7 years ago

only going directly to SPLDoc Functions can actually discover them.

Dumb question, but what else would be expected? A toolkit is documented by its SPLDOC.

rnostream commented 7 years ago

Two points Documenting the version of rapidjson being inserted would be good, 'latest version' changes during time. It is 1.1.0. Second point is more important. As the new functions should be used to generate a flattened tuple output, and as I understand all of the output attributes will be assigned by using these new functions, from my perspective it is more appropriate to have here a new operator. Especially because the JSON parsing should be done only once because of performance reason, as already stated. An additional native function would not solve this only-once parsing problem. A C-operator providing single JSON parsing and attribute assignment with output-functions (JSON path as parameter) or compile-time code generation for path-to-attribute assignment from a certain configuration (file,xml, csv ... ) is a better way. Isn't it?

leongor commented 7 years ago

@rnostream agree about rapidJSON version. Regarding operator vs native functions - single parsing/multiple query can be implemented either way. I did both in streamsx.regex toolkit (native functions implemented using lazy init + multiple calls). I think each way has its own benefits. If there will be request for such operator, we can add it later.

leongor commented 7 years ago

I've created a new PR only for RapidJSON update

rnostream commented 7 years ago

@leongor, yes it is correct it is possible, but it is a question of "what is the Streams way". When using native functions we are at SPL programming level. And yes native functions extend the SPL possibilities by implementing some sort of logic in C but at SPL level one should be able to read the code just at this level. Also when possible, having cross-dependencies not visible at SPL level is something I would except for a project but not at a general level. So from my point we should be careful what to provide as toolkit. Streams way is using and configuring operators, I saw there already to much code which tends to be structured programming under the hood of huge Custom operators, which is not the original way. The JSON value extraction and attribute assignment is a typical use of an operator transforming an input into an output, there is no further logic here than extracting and assigning. An (e.g. ExtractJSONPathToTuple) operator would be direct usable by others, and from toolkit perspective the generic solution we want to have. A primitive operator may require more effort but even related to automated SPL application generation a primitive operator is the better way (OK, you may have a solution for generating the appropriate SPL code, but not everyone). But also for you a generic operator which gets at compile time an configuration-input related to attribute-JSONpath mapping is more ideal, isn't it? Don't understand me wrong, I see there the use for extracting values from a JSON string by a given path on SPL level and I would like to use it, but not for pure input-to-output tuple transformation.

leongor commented 7 years ago

@rnostream I agree that a primitive operator has many benefits, but there are some tasks that can not be done with it and require native functions. For example: calling it multiple times in a loop or just calling it inside some SPL expression. That why I applied in streamsx.regex both ways, and if I look at projects that use it to see some statistics - regexMatch native functions are used everywhere, while RegexMatch operator is not used at all!

brandtol commented 7 years ago

A few suggestions/questions :

1) How do you signal parse errors to the user. I dont see return codes for this. I would suggest to add an error argument to the queryJSON functions, and a list<rstring> paramteter to extractFromJSON() to capture error messages comming up during parsing.

2) It seems you use ValueHandle::getMetaType() to determine attribute types at runtime. I guess that uses C++ RTTI and may hurt performance. This could be avoided with a C++ primitive operator, by examinig the Streams schema at compile time, I think ?

3) When using the approach of two primitive functions to single parse/multiple query : where do you store the result of the parsing operation. As far as I know SPL does not allow for application or even PE scoped variables (apart from DPS or shared memory). One option woud be to return the preparsed DOM by the parse function, and use it in the query function ?

    blob dom1 = parseJSON(rstring jsonString1);
    rstring name = queryJSON(dom1, "/root/name");
    int32 age = queryJSON(dom1, "/root/age");
    ...
leongor commented 7 years ago

@brandtol

  1. extractFromJSON() function always succeeds trying to find as much as possible attributes in JSON. There is a new function I've added to be used with queryJSON() - parseJSON() that returns error code.

  2. Yes, you're right and that what I always did till now (streamsx.adaptiveParser, streamsx.mongoDB, streamsx.thrift). But here the goal is different. If you create an operator and generate the code in compile time, you will iterate the tuple scheme and search for the tuple attributes inside JSON, so you must parse the whole JSON with DOM parser first. On the other way, extractFromJSON() uses SAX parser, so each JSON attribute comes as an event and then searched in the tuple. If you have a huge JSON, but you need only 5 attributes, once the tuple is filled the parsing is over!

  3. Here a sample using parseJSON() function:

    type TestTuple = boolean t, rstring str, int32 i, int32 i2, uint32 n, float64 pi;
    
    stream<rstring jsonStr> BStream = Beacon()  {
            param iterations: 1;
            output BStream: jsonStr = '{"root" : { "str" : "Hello world", "t2" : true, "i":123, "pi": 3.1416 }}';
    }
    
    stream<TestTuple> Parsed = Functor(BStream) {
        logic
            onTuple BStream : {
                uint32 rc = parseJSON(jsonStr, JsonIndex._1);
                println(rc);
            }
            output Parsed :
                t = queryJSON("/root/t2", false, JsonIndex._1),
                str = queryJSON("/root/str", "", JsonIndex._1),
                i = queryJSON("/root/i", 0, JsonIndex._1),
                i2 = queryJSON("/root/i2", 0, JsonIndex._1),
                n = queryJSON("/root/n", 42u, JsonIndex._1),
                pi = queryJSON("/root/pi", 3.14, JsonIndex._1);
        }
brandtol commented 7 years ago

Regarding point 3) above

So the DOMs are stored in PE heap memory and each PE thread can have up to 20 pointers to different DOMs, via the JsonIndex enum, which serves as a sort of handle to the parsed JSON string ?

I think with that approach it needs careful documentation to state that parse/query functions must not reside in different PEs or different threads. Also the queryJSON() functions should return an error if the JsonIndex cannot be found, instead of creating a new empty DOM, and returning default values.

brandtol commented 7 years ago

Regarding point 1) above

a) I dont see how extractFromJSON() can always succeed. Say you have a JSON attribute named age that holds the string "hello" and a Streams attribute named age of type int32. Of course the conversion will fail. As far as I see, the age attribute in the returned tuple will get its default value of maybe 0. In that case the user cannot distinguish between a valid value of 0 , coming from the JSON input, and a parse error.

b) similar problem for the queryJSON() functions. Assigning the default value in case of parse/conversion errors is ok, but the user will not know what happened. Especially if the queryJSON could not find a preparsed json string/DOM for the given infdex, the use will not notice that they have a serious problem somewhere ... ?

brandtol commented 7 years ago

Regarding point 2) above

With a SAX style parser you could also generate the "parseJsonEvent" function (or whatever it is called) at compile time and maintain a list of needed attribute name strings and flags if they have been filled already in that function/class. That might save a few reflections and casts.

But on the other hand, I agree that this should be considered as an improvement, only if measurements and customer requirements indicate that current solution is not sufficient.

leongor commented 7 years ago

@brandtol Regarding point 1) Good point about not using 'parseJSON' function before a query. I've added a check for empty document. About the "default values" approach in both functions - there are only 2 main choices I see: throw exception (and in most cases the whole PE) or apply defaults. No one is ideal, but usually customers prefer optimistic approach, when having millions of JSON events with different schemas or even partially filled. Of course there are other cases where such approach is not acceptable.

brandtol commented 7 years ago

@leongor Re 1)

Throwing an exception if the document is empty is fine , I think. So the user will definitely notice :-) For signaling other errors from queryJSON, could not you just pass in another int/bool/enum parameter that will contain the outcome of the operation ?

queryJSON(rstring jsonPath, T defaultVal, E jsonIndex, mutable uint32 errorCode)

errorCode could be 0 if the value was taken from json, 1 if json did not contain the attribute and 2 if the value could not be converted. Or something similar ... You could also have two versions of the functions, one with error code and one without.

leongor commented 7 years ago

@brandtol Good idea. Have added overloaded 'queryJSON' with 3 (as it was) and 4 parameters. Should be called like this:

uint32 rc = parseJSON(jsonString, JsonIndex._1);
if(rc == 0) {
   mutable JsonStatus.status status = JsonStatus.FOUND;
   otuple.str = queryJSON("/root/str", "", status, JsonIndex._1);
   println(status);
}

'JsonStatus.status' is a predefined enum type:

public composite JsonStatus {
    type
        static status = enum{FOUND, FOUND_WRONG_TYPE, FOUND_NULL, NOT_FOUND};
}
rnostream commented 7 years ago

@leongor Please add appropriate documentation about this solution so that anyone who want's to use this is aware of all preconditions and side-effects. Additionally please add testcases (functional, e.g. accessing JSON values, as well as non-func). I'm looking at the code and try to understand but I can't proof all things via code review. As this solution is sensible related to threading I would expect especially tests for using the new functions in situations of having different input threads.

bmel commented 7 years ago

Use case: Numeric or Boolean data that are enclosed in double quotes, in the incoming JSON messages.

Problem: I noticed that when a JSON attribute value is numeric but quoted (enclosed in double quotes), I’m getting the default value back (in my case: 0.0).

Change request: Please return the numeric values even when they are quoted (when the default value is numeric). Likewise for Boolean values, when the default value is true or false.

schubon commented 7 years ago

@bmel: Is your comment aimed at the new functions discussed in this issue here? In contrast, if you don't mean the functions discussed here, open upa new issue for that,please.

rnostream commented 7 years ago

@bmel -1 As JSON defines: any value in quotes is a string, string content is is transparent except those chars to be escaped. In contrast a JSON number has a defined encoding. So what type of encoding will the numeric string in a string value have? If it is as defined by JSON for numbers, why does the sender doesn't provide a correct JSON with an number value? Casting from a string value is not part of decoding a JSON. This has to be done outside, because the encoding of the numeric string is not necessarily common, it is specific for sender and receiver. I don't see any need to do this in a common JSON decoding function.

bmel commented 7 years ago

@schulz2

@bmel: Is your comment aimed at the new functions discussed in this issue here? In contrast, if you don't mean the functions discussed here, open upa new issue for that,please.

I meant the new functions.

leongor commented 7 years ago

@rnostream JSONToTuple operator supports implicit conversion from JSON string value to numeric tuple attribute. So I added it to queryJSON too.

rnostream commented 7 years ago

@leongor I'm still not convinced about an implicit type conversion from string; also if this part of JSONToTuple. @bmel What is your use case where a JSON sending entity is not able to write a number as a number. However, only catching the exception and writing just the default will be source of confusion when cast fails. Getting the exception is an error case needing a) setting an error indication and b) logging of this event.

leongor commented 7 years ago

@rnostream It's not only about catching the exception and returning a default, but also status 'FOUND_WRONG_TYPE' is returned. If casting succeeded then status 'FOUND_CAST' is returned.

rnostream commented 7 years ago

@leongor Yes you are right, I missed this line because of GIT's diff is not quite compact here because of formatting changes in the code.

bmel commented 7 years ago

@rnostream

What is your use case where a JSON sending entity is not able to write a number as a number?

In general, the use case is when we don't have control over the source data set, but ahead of SPL creation can somehow determine the nature of that data (numeric etc.) despite it being quoted.

Our case in more detail:

  1. The data are provided or gathered by a third party or authority. Our users are using it for data mining, analytics, machine learning etc. Real-life data formats are often sloppy or unaware, and simply quote everything, even numbers and Booleans.
  2. In a guided process, before building a streaming pipeline, our courteous code and/or the user inspect some sample data from the source data set and determine the actual data type (string, number, Boolean).
  3. When building a pipeline, the model created on canvas is translated into SPL code (code generation). At this point, the codegen code 'knows' the actual datatype as determined in step 2. The call to queryJSON() therefore explicitly requests e.g. a numeric value, by virtue of supplying "0.0" as the default value. Thus, we are not asking to change the default or pure JSON parsing logic. We are merely requiring to easily return non-string values even when quoted, when we know their nature.

I can point you to demos of our product, on request. Hope this helps.

rnostream commented 7 years ago

I understand the use case, trying to process as many different data format variations as possible, but converting a string with a numeric value representation to a numeric value has some implications. JSON has a strong definition of a numeric value representation, as such it is independent from languages and locales. Anyone can generate and understand JSON, anywhere. If you want to convert a string with a numeric value representation it is not the case. Depending an generating as well as reading-side tools/libraries the locale comes into play, as well as application specific formatting at source side (e.g. white spaces ). So either you define a strict definition how the numeric value representation in the string has to lock like, and as such needing utilities being independent from locale at both sides, or you define it like classic locale, being not able to process data from other locales, or you need the locale of source to convert. Anyway, boost lexical_cast is based on locale, and actually the suggested conversion is not a general usable one which you need to be able to process "any" input data. I see two way's here:

  1. The implicit cast is based on the JSON defined numeric value representation, this is as for classic locale. In this case a JSON string containing numeric value representation which is valid for other locales can't be converted, this clients data can't be processed. This needs clearly to be documented! Further the implementation needs to be changed to force boost_lexical cast to use the classic locale by explicitly setting this locale. It may be that the locale is set different by an Streams application when using this toolkit function onPrem, leading to the result that JSON representation and actual locale based representation used for cast differ and don't meet the requirement from 1st sentence. This one has restriction but needs only slightly changes.

  2. Define a set of functions basing strict only JSON and as such providing valid results (and a string is a string). These functions are generic and low level. Put the cast to a higher level by adding functions wrapping the first ones. So there is a clean differentiation between pure/clean JSON and a possibly more complex stringToX cast. From logical view this is in fact a different level, on top of JSON. And there are possibly some more steps necessary for successful cast. This one is more complex and need to know the locale of the source. On the other hand also clients data containing strings with a numeric value representation different to JSON/classic-locale could be supported.

leongor commented 7 years ago

@rnostream I think your suggestion to add classic locale for an implicit string conversion is the right decision, because this conversion is intended for "dirty" jsons and not for the specific locale representation of the number. Added it to the code.

leongor commented 7 years ago

Added samples/JSONToSPLConvert with extractFromJSON and queryJSON samples.

brandtol commented 7 years ago

+1 for merging this pull request

schubon commented 7 years ago

@leongor: Going to merge a.s.a.p. ... anything more you wanted to add before the merge?

leongor commented 7 years ago

@schulz2 I can copy to function.xml few lines of SPLDoc as @brandtol asked. Do you want to wait or add it later?

schubon commented 7 years ago

@leongor I'll wait and pull it in one flush then.

schubon commented 7 years ago

@leongor Any idea when you'll be able to deliver the few lines?

leongor commented 7 years ago

Sorry for a delay - added some small fixes additionally to SPLDoc.