IBMStreams / streamsx.json

Toolkit for working with JSON in SPL applications.
http://ibmstreams.github.io/streamsx.json/
Other
3 stars 19 forks source link

Request: Optional dropping of '.0' for round floats/decimals in tupleToJSON #107

Closed bmel closed 4 years ago

bmel commented 6 years ago

Summary

Requesting approval to add an option for dropping the trailing .0 in round floats/decimals, when formatting tuples to JSON. Example: 77.0 vs 77.

Details

Background

In Streams Designer (SD), the user type 'Number' is used for integer and decimal numbers. This simplifies the product usage, and users don't need to know or decide which exact number type to use for numeric data. Under the hood, SD uses the SPL type float64 (and possibly, at some point, decimal128), in order to accommodate even the largest integers or decimals.

Until now, SD has been turning tuples into JSON by string concatenation and value formatting using straight SPL. SD could not use the JSON toolkit, for it did not afford it the required key naming flexibility. Using SPL’s number formatting, round numbers come out without a trailing .0, even though they are always of type float64 in the SPL code as generated by SD.

Now, for reasons of performance and "DRY", SD is going to use the JSON toolkit for converting tuples to JSON. To this end, the optional mapping parameter 'keysToReplace' is currently being added to the 'tupleToJSON' function, in a github fork.

Problem

When using the JSON toolkit, even round decimals come out with the dot notation. Example: 7 -> 7.0 This is confusing and looks wrong for when the numbers are conceptually integers.

Request

Asking for green light to present round float/decimal numbers in JSON formatting without a trailing .0. SD-requested enhancements in this toolkit have so far been implemented by @leongor . I'm now taking over this role, and so am seeking consensus for me to go ahead and implement this option and create a PR.

ddebrunner commented 6 years ago

I'd be concerned about performance of conversion for the typical case where one doesn't care about the JSON representation.

It seems more this is a SD issue, if a integer representation is required then use SPL int64, not a floating point?

ddebrunner commented 6 years ago

Didn't mean to close the issue.

leongor commented 6 years ago

This issue is about native functions formatting to Json, not the operator. Additional parameter is proposed to allow numbers to be formatted not depending only on their type, but also on their value.

On Fri, Jun 15, 2018, 16:27 Dan Debrunner notifications@github.com wrote:

Didn't mean to close the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IBMStreams/streamsx.json/issues/107#issuecomment-397619533, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvlA-Ybga-QyJBvXXbR8feFgTyZ1awKks5t87YpgaJpZM4UouZW .

bmel commented 6 years ago
  1. SD does not require the users to specify, for simplicity, if the numbers are ints or decimals. Therefore, at codegen/compile time, SD does not 'know' if the numbers are ints or decimals. Requiring users to distinguish between ints and decimals requires a significant change in SD spec and disrupts backwards compatibility. If I saw a way to address the runtime distinction in SD-generated SPL, I gladly would. As it stands, the seemingly only way to keep the round presentation is to remain with the inefficient concatenation in straight SPL and not use the much more efficient JSON toolkit's native C++ function. (Or spawn an SD-specific JSON toolkit by copy and modify?) This is blocking an awaited performance effort: https://github.ibm.com/wdp-streaming-pipelines/streaming-pipelines-planning/issues/1772

  2. To add to @leongor 's comment: The overhead for current use cases could be e.g. a check if the optional parameter (e.g. "dropTrailingDotForRoundDecimals") evaluates to true, per float/decimal tuple attribute. Or once per tuple.

  3. Rendering round decimals without trailing .0 is what SPL's string concatenation does, too. So it's not an exotic format, and might be useful to other users of the JSON toolkit.

ddebrunner commented 6 years ago

SD does not require the users to specify, for simplicity,

Not sure how useful that will be for real applications. You need to know the accuracy of a numeric data type so you can understand if your calculations will be correct.

ddebrunner commented 6 years ago

I would disagree with additional parameters, its just adding to complexity.

It is true that floating point string representations typically round from the true value to the least number of digits needed to represent, though Java's representation includes this statement: There must be at least one digit to represent the fractional part

As I said, my concern is performance, so getting benchmark numbers on the producing the existing and proposed format would be helpful.

bmel commented 6 years ago

Proposal with reduced complexity

I discussed this with @leongor. Our suggestion, without introducing additional arguments: The current API can be used as before, and it will behave exactly as before, by default. We would introduce an optional 'option' state, for customizing the output. Each option will have its own setter. Examples:

tupleToJSON_setDropTrailingDotZeroForRoundDecimals(true); // default: false
tupleToJSON_setTimestampFormat("%FT%T"); // example: 2017-07-10T12:55:16
tupleToJSON_setKeyNamingMap({"___long":"long"});

Example usage in a Custom operator:

            logic
                state : 
                  tupleToJSON_setDropTrailingDotZeroForRoundDecimals(true);
                onTuple sourcePort : {
                    submit({json = tupleToJSON(sourcePort)}, targetPort);
                }

The stateful options would be local per operator.

Advantages

Benchmark numbers will be provided.

bmel commented 6 years ago

SD does not require the users to specify, for simplicity,

Not sure how useful that will be for real applications. You need to know the accuracy of a numeric data type so you can understand if your calculations will be correct.

SD parses all incoming numbers into the widest numeric type (float64). When operators output numbers of other types, then they also get converted to float64 by code-generated SPL. Therefore, calculations should be correct. In order to avoid in-creeping rounding errors, moving to decimal128 is being discussed, pending benchmark comparisons.

ddebrunner commented 6 years ago

Therefore, calculations should be correct.

Correct according to float64, not correct if the application was expecting 64-bit integral arithmetic.

ddebrunner commented 6 years ago

The stateful options would be local per operator.

Curious how you get the operator state from a C++ function call?

bmel commented 6 years ago

Auto-generated type conversions

Therefore, calculations should be correct.

Correct according to float64, not correct if the application was expecting 64-bit integral arithmetic.

Application? If a processing operator expects a non-float64 number type as input, then SD codegen will inject the appropriate type conversion as needed. Same when an operator outputs a non-float64 number type, it will get converted to float64 for further processing. Obviously, at some point, if this leads to measurable performance cost, then SD codegen should be able to optimize unnecessary type conversions away. But in the opinion of the SD product designers, users should not need to get involved at this level of machinery.

State access in C++ functions

The stateful options would be local per operator.

Curious how you get the operator state from a C++ function call?

Using static and an anonymous namespace (to isolate from other operators on same PE). @leongor - please correct me if I got this wrong.

leongor commented 6 years ago

@ddebrunner Actually it doesn't access the operator state. The function encapsulates the options as a static object with a lazy initialization (run only once), so it can be placed anywhere. A similar approach is used when parseJSON function shares a parsed object with queryJSON function (also in streamsx.regex toolkit to compile regex once). Anonymous namespace is used to assure that each compilation unit (e.g. operator) gets its own function copy.

ddebrunner commented 6 years ago

@bmel Doesn't really matter what SD does with numbers, an application developer needs to understand the behaviour of numeric types. If a financial application was being developed then I think it's extremely unlikely a floating point type would be selected as it can lead to rounding errors.

bmel commented 6 years ago

Right. As you can see here, this has been thought of: https://github.ibm.com/wdp-streaming-pipelines/streaming-pipelines-planning/issues/1069

That's why at some point, SD shall move to using decimal128, which should be large and exact enough for most applications.

Should some scientific applications require computations with numbers that are closer to 0 or further from 0 than decimal128 allows, then there are some possible approaches. For example:

  1. Users to 'normalize' their numbers by multiplying or dividing with the 10-exponent that is common to those numbers.

  2. Have a 'scientific' switch/flag (per operator, or per flow), and SD will generate float64 types instead of decimal128.

  3. Have an 'advanced user' mode, where users can choose between decimal128 and float64, per numeric tuple attribute, when they define their schemas. (Or maybe even all the other numeric types, if this can squeeze out better performance.)

rnostream commented 6 years ago

Some thoughts on this topic. I'm not sure if I understand the SD architecture/design but I've some points to share.

Regarding JSON JSON is a M2M data interchange format not a UI!, even if it is human readable, who should be confused about .0 ? JSON relies on a common understanding of the JSON content between sender and receiver, and I would expect that they agree on "number" and not on a special representation. A trailing .0 is a valid JSON number representation. The JSONToTuple operator will convert a 10.1 JSON number representation to an int32 attribute value 10 if the tuple schema defines the attribute as int32. It is JSON readers decision how a number representation is converted to a local type and as such a .0 is no problem, information loss because of casting is receivers decision. Do you have already problems on receiving side with JSON strings containing .0 ?

Regarding suggestion 'cutting the .0' You complain that trailing .0 may be wrong/confusing. But introducing straight cut of every round float/decimal may also be confusing if you have a real float/decimal value being accidentially round, it will be shown w/o .0 and you may now concern that it could be interpretet as integer value ... Circumventing this with a "list of attributes to be handled" will be an implementation just hitting your very special requirement and not a common usecase for Streams applications. Do you have the exact type of the data you want to send via JSON? If yes why don't you generate a correct tuple which can be converted to JSON (except the copy effort)? This is the Streams like way other applications go.

Regarding floating point arithmetic Calculation would be OK but comparison using literals is a problem anyway, isn't it?

bmel commented 6 years ago

I hear essentially the following objections to this request:

1. It's not important b/c JSON is not for human consumption.

A: But in order to understand their data, humans inspect it either directly in JSON messages, or in a UI that pulls the values from JSONs as-is, b/c it does not have the metadata to know how to format the values. Output of SD-generated applications should not be confusing to humans.

2. SD should enforce typing knowledge, just as other streams apps do.

A: But SD is not strictly a streams application, it's an abstraction layer. It's supposed to make streams authoring easier. Abstracting away number types is one of its usability features. Thus, it does not know at compilation time which numerical attributes are limited to ints.

3. It's not possible to obtain correct computations when mixing ints and decimals.

A: But other languages do it, even those that compile to strongly typed ones, such as JavaScript on V8.

4. The requirement is SD-specific, since regular streams applications will use strong typing.

A. If this is indeed so, then I can still plead that the API change won't be intrusive, as explained in https://github.com/IBMStreams/streamsx.json/issues/107#issuecomment-398395125

If all this does not succeed in making a case for having this option in the JSON toolkit, then I'll leave this request for now. Should SD users complain about weirdness of .0, then let's come back here, or I can simply copy the C++ functions from the JSON toolkit to SD-private space, and implement the formatting options there. It would be obviously unfortunate for SD to lose the benefit of future JSON toolkit improvements by the community.

Details

JSON formats

JSON is a M2M data interchange format not a UI!, even if it is human readable, who should be confused about .0 ?

JSON has taken over as a human-readable data format for data in general, often displacing CSV, for example. And humans, including SD users, eyeball such data in order to better understand what it represents and how they can use it. (Not whether they can store this number in uint16 or float32.)

This looks weird: { "number_of_children" : 2.0 } But this, although it represents a real number, does not: { "capacity_in_liters" : 550 }

Number typing

Picture this abstraction as:

  1. SD 'compiles' (code generation) the flow graph into SPL + Python
  2. Streams then builds a sab file from that, where SPL is compiled to C++.

Conceptually, this can be compared to how JavaScript is run in V8, which is written in C++.

SD users write any computation (business) logic in py, where mixing ints and decimals in an expression should not be a problem, even when they are literals: https://stackoverflow.com/questions/10037115/comparing-a-float-and-an-int-in-python

Topology converts all output values for floats/decimals at the output of py code operators.

Other languages also absolve users from worrying about how numbers are stored, by weak or dynamic typing, or by auto-converting where needed, e.g.: JavaScript, R. Actually, in JavaScript, there is no distinction at all. It's just 'number'. Possibly, this has been the very inspiration for the parallel design decision in SD.

Once SD uses the SPL type decimal128 for numbers, then exact comparisons or accumulative rounding errors won't be an issue.

ddebrunner commented 6 years ago

Actually, in JavaScript, there is no distinction at all. It's just 'number'.

Technically it's that JavaScript has a single type for numeric values Number which is of type 64-bit floating point. That doesn't absolve users from worrying about how numbers are represented.

https://www.w3schools.com/js/js_numbers.asp http://www.ecma-international.org/ecma-262/6.0/index.html#sec-terms-and-definitions-number-value

ddebrunner commented 6 years ago
  1. It's not possible to obtain correct computations when mixing ints and decimals. A: But other languages do it, even those that compile to strongly typed ones, such as JavaScript

JavaScript performs computations correctly according to 64-bit floating point IEEE 754, that doesn't mean it supports mathematically correct computations. For example it's integer arithmetic breaks down with values bigger than ~ 1E53.

In the following e will be true, that is x == x + 1.

var x = 123e54;
var y = 1;
var z = x+y;
var e = z == x;
rnostream commented 6 years ago

RAPIDJSON as base library for the native functions of the JSON toolkit supports only double as floating point type. Any SPL floating point value (incl. decimal128) is casted to double at the end. Please keep this in mind.

leongor commented 6 years ago

@rnostream This is the current not ideal implementation of handling decimals. I've fixed it in my fork - casting decimal to string and formatting it as a row value. I didn't open PR on it yet because of the other changes in the code. I've added a new parameter called keysToReplace that is mutual exclusive with prefixToIgnore. It suppose to provide a mapping to create keys that cannot be presented by a tuple attribute: {"some_key" : "some-key"}. Now waiting for the decision about the rounding floats and moving all the formatting options to the dedicated function.

ddebrunner commented 6 years ago

@leongor Why not separate keysToReplace from truncating .0. They are independent features right?

leongor commented 6 years ago

@ddebrunner I have added it before the new setting options suggestion, so I'm wondering if it worths adding it now or to implement it in the new way right away.

bmel commented 6 years ago

JavaScript performs computations correctly according to 64-bit floating point IEEE 754, that doesn't mean it supports mathematically correct computations. For example it's integer arithmetic breaks down with values bigger than ~ 1E53.

Yes, number storage is finite in any number representation, and users actually coding computations need to know the respective limitations. I agree that SD should document the limitations of its under-the-hood number type, regarding precision loss characteristic and overflow thresholds.

bmel commented 6 years ago

Back to the JSON formatting issue, the request is for the toolkit to allow (optionally) JSON formatting that's in line with what one gets with this JS code, even though 42 is not declared nor stored in JS as int:

const data = { productCode: 42, price: 5.12 };
const json = JSON.stringify(data);

Here, json comes out to: '{"productCode":42,"price":5.12}'

leongor commented 6 years ago

I'm going to open a new PR for adding setFormatOptions function that expects the following tuple argument:

public composite JsonFormatOptions {
  type
    static options = tuple< map<rstring,rstring> keysToReplace, rstring prefixToIgnore, rstring tsFormat, boolean tsUTC>;
}

What is the final decision for cutting fractional part of round floating point numbers? I can add it as an option with a default behavior to be false.

rnostream commented 6 years ago

JSON is directly reflecting what JavaScript provides, several representations of numbers which are all defining a float value at runtime.

I still think that this is not the Streams like way, but if you build at top of Streams a layer which only handles one number type (as JS)... From this point of view it is understandable that you would like to have '.0' less representations.

I still wonder if SD has no sources and sinks which are strongly typed and need user definition of schemas. It seems to limit the things being direct usable from Streams.

Please let me have final discussion on this topic again internally. I'll come back on Monday.

bmel commented 6 years ago

Hello @rnostream ,

Thank you for taking the time to see the SD point of view, regarding the need for '.0'-less representations.

I'd like to now address the remaining concerns. You write:

I still wonder if SD has no sources and sinks which are strongly typed and need user definition of schemas. It seems to limit the things being directly usable from Streams.

This is a concern regarding possible limitations due to exposing just one number type in SD despite the Streams Engine's strongly typed system.

Yet automatically reconciling value type collisions is already an essential part of SD's simplification layer, as it is.

I'd like to address your concern by describing the current type/value adaptation in SD. Please alert me with concrete examples if you think that the SD approach does not cover them.

So here we go...

Conceptual typing in SD

SD exposes four "high-level" types to the user: Text, Number, Boolean, Date. Details here: https://dataplatform.ibm.com/docs/content/streaming-pipelines/data_types.html SD prevents users from using e.g. Text where the operator requires a number, be it int or float. So there is typing in SD, it's just that it is conceptual rather than storage-defining (physical). Mapping between this conceptual typing and the physical typing of Streams is among the aspects when SD generates SPL code from user-designed flows.

Ubiquity of value conversions

Even if SD did separate 'Number' into 'Integer' and 'Float', the challenge would remain that in Streams, there are int8/16/32/64 and float32/64. Values would still need to get adapted/converted.

And it's not just sources and sinks that can be strongly typed, it's also e.g.:

How SD does it

In any case, all in/out schemas of operators, sources, and sinks are known to SD by the time that it generates the SPL code. So, during the SPL code generation, SD adapts the values between the output and the input port of a 'link', as necessary.

This is needed not just for numbers, which is done easily by e.g.: trgPort.quantity = (float64) srcPort.quantity. It is also needed when e.g. populating DB2 sinks with dates, times, and datetimes. DB2 expects them as strings in certain formats, and SD needs to inject into the generated SPL code the formatting of spl::Timestamp values to such strings. I'm only saying this to illustrate that SD is already heavily in the business of automatically reconciling type- or format collisions by converting/adapting values, and int/float conversions are certainly a part of it.

Generated code example

The user flow is: MH -> SPSS -> MH. The SPSS model in this case expects int64 for the attribute salary, but it arrives as float64 after parsing a messagehub JSON message (SD parses all 'numbers' from JSON into float64): Here is the code that SD generates at the input to the SPSS operator, to convert 'salary' from float64 to int64:

        @catch(exception=all)
        (stream<InputTypeAdapter1_gen_output1_schema> InputTypeAdapter1_gen_output1 as target_a) as InputTypeAdapter1_gen = Functor(JsonToTuple1_gen_output1)
        {
            output
                target_a :
                    salary = (int64) salary;
        }

        @catch(exception=all)
        (stream<schema1> watson_machine_learning_xa5kx2g1qu_output1 as target_a) as watson_machine_learning_xa5kx2g1qu = com.ibm.spss.streams.analytics::SPSSScoring(InputTypeAdapter1_gen_output1)
        {
rnostream commented 6 years ago

@bmel and @leongor , sorry for this very late response. In general I agree on this now. I'm sure you take care to keep the interface straight that others are not impacted ;-) We see this in review. Additional note, please provide an in-depth documentation and !very important! test cases. Next time there will be merge of new features and we need test cases for regression test. As we haven't changed the test approach for this toolkit until now, please add the testcases in /tests. These tests are self-validating SPL applications started by the makefile. Simple, no huge framework

rnostream commented 4 years ago

Close this now as no further conversation resp. actions on this for 2 years.