numbers in JSON format become strings

dmwm / PHEDEX

CMS data-placement suite

8 stars 18 forks source link

numbers in JSON format become strings #599

Open ericvaandering opened 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 61533 reported by huangch on Mon Jan 18 08:15:50 2010.

Numbers in JSON format become strings in all APIs. As an example, look at the "groupusage" API.

http://cmsweb.cern.ch/phedex/datasvc/json/prod/groupusage

attributes such as "dest_bytes" are formatted as

+verbatim+ "dest_bytes":"2747357325697258" -verbatim-

instead of

+verbatim+ "dest_bytes":2747357325697258 -verbatim-

this presents a problem in some clients, which end up trying to sum strings and throwing an error.

JSON::XS documentation says that if a scalar was used as a string, it will be mapped as a string.

http://search.cpan.org/~mlehmann/JSON-XS-2.27/XS.pm#MAPPING

As far as I know, nowhere in the data service do we explicitly treat result values as a string.

It is possible that all variables which come out of DBI::Oracle are considered strings internally to perl... There are some ways to set column types in DBI, maybe that would fix the problem.

If the only way to solve the problem is to crawl through the output object and explicitly set numbers to numbers, this bug may be too expensive to fix in the server...

ericvaandering commented 10 years ago

Comment by wildish on Tue Jan 17 09:11:03 2012

Valentin just raised this issue with me, I append my comments to him here, for the record.

I see three possible solutions: 1) explicitly try to convert all fields to numbers before returning the data (attempt 'numification' of strings). This could be done by simply walking the data structure and trying to numify everything. That's clearly got a large overhead associated with it 2) explicitly numify only the fields that should be numeric. That means managing more metadata (which fields of which APIs are to be checked etc).

since we also provide Perl and XML format output, and they don't have this problem, these fixes would be best applied only in the JSON output formatter. 1) is trivial to code, but hugely expensive, 2) is non-trivial to code, and greatly complicates our APIs.

Option 3 is to simply rely on coding things 'right' to avoid this problem, and to explicitly test for it afterwards. That's actually more complex than option 2), since it means having the same metadata to say which fields should be numbers, and using that in a client tool that examines the results, rather than on the server that creates them. It also catches things too late, we may only see issues when we're about to release. So that's not a good idea.

I guess there's an option 4, which is to introduce a new format 'jsonparsed', or something like that. Clients could ask for that if they are willing to wait for the overhead on the server to handle the numification. We may go that way eventually, following the discussion Lassi initiated last year about more compact, faster data structures (CSV-like tables with headers instead of hashes/dictionaries/objects), but that's not going to happen anytime soon.

ericvaandering commented 10 years ago

Comment by huangch on Wed Jan 18 22:42:37 2012

I stared at this for a day, partly trying to recall where we stopped two years ago and partly thinking about a reasonable solution.

The issue is deep in Oracle DBI, which PhEDEx does not own. It simply returns everything in string. Perl, by design, doesn't care since it does the conversion on the fly all the time. After all, this DBI is a perl module. To root out the issue, one has to modify Oracle DBI. That's where we stopped.

I'd propose a reasonable solution 5), which is a variation of 2) and, similar concept of, 4). Since [most] DataService APIs are using PHEDEX::Core::Util::flat2tree() to build output structure, the numification could be done there since it has to traverse/build the structure any way.

The implementation of extra metadata (type info) could be in the definition of output structure used by flat2tree(). For example, instead of "bytes => 'BYTES'", we may use "bytes => 'BYTES#'" to indicate (a) the field "bytes" has value from $result{BYTES} and (b) it has to be a number (the trailing '#'). This should work for all formats and it does not punish JSON unfairly. There are a handful of APIs that do not use flat2tree(). In case they need to numify the output, it could be done manually in place.

ericvaandering commented 10 years ago

Comment by wildish on Thu Jan 19 03:53:01 2012

it's not just a problem of the Oracle DBI module, in that any quantity may be numified or stringified by accident during post-processing. E.g, simply constructing a string for logging or debugging purposes will stringify all the variables it uses!

So the flat2tree idea is good, but we have to be sure that nothing touches the object after it has been numified. That ought to be OK in most cases, since flat2tree appears in the return statement :-)

I'm not sure about the 'BYTES#' idea, since that requires a string comparison and manipulation. It might be better to use a sub-hash, and test the value to see if it's a REF instead of testing it for a trailing #.

That would give us something like: "bytes => { field:'BYTES' }", with code in flat2tree looking something like:

if ( ref($h) ) { $value = numify($h->{field}) } else { $value = $h; }

Using a hash instead of a string would allow us to add more fields to it. E.g, 'numify' could be the default 'formatter', but other formatters could be specified:

time => {field:'time', formatter:'unixEpochToUTC'} or cksum => {field:'cksum', formatter:'parseCksumString'} (to convert "cksum:123,adler:abc" to {cksum=>123, adler=>'abc'}, which is what Valentin would like)

Most of our APIs would have no need for extra formatters as we stand today. However, adding formatters here could help save time for specialised APIs for use in clients, e.g. web pages. The data-subscriptions page, for example, has to parse a lot of dates, and convert bytes to MB/GB/TB/PB as appropriate. Adding custom formatters in the server could allow us to reduce the load on the client, which might make a net performance gain.

It's a bit more work, perhaps, but it does open some possibilities.

ericvaandering commented 10 years ago

Comment by huangch on Thu Jan 19 11:13:30 2012

Since flat2tree() already uses hash for sub-structure, I would suggest to use array to specify [<field>, <type>], such as 'bytes' => ['BYTES', 'int'] and 'checksum' => ['CHECKSUM', 'parseCksumString'].

The <type> implies the format. In fact, we may allow <type> to be a code pointer so that the processing/coversion function could be supplied by the caller.

ericvaandering commented 10 years ago

Comment by huangch on Fri Jan 20 00:18:54 2012

I have implemented it in PHEDEX::Core::Util::flat2tree(). I also modified PHEDEX::Web::API::BlockReplicas.pm to specify files, bytes and timestamps as int. Both have been committed to CVS. If it works fine, I'll do the same treatment to other APIs.

One observation is, Dumper(), used by XML formatter, does not print large number right. A number larger than 233+X (where X < 233) will always show up as string. However, internally, it seems to be a number. See the bytes in both perl/prod/blockreplicas and json/prod/blockreplicas. In perl output, bytes are shown as strings and in json output, they are correctly shown as numbers.

The principle of choosing a field to be numified needs to be defined. It could be, [1] numify every field which is of number type in database, or, [2] only numify the field that may be numerically operated on.

So far, I am leaning toward [2]. That is, for example, in BlockReplicas, block id was a number in database but it will not be numerically operated on, so it was left as string. files, bytes and timestamps could be operated on, they are numified. The drawback of [2] is losing the consistency in [1] though.

ericvaandering commented 10 years ago

Comment by huangch on Fri Jan 20 00:23:09 2012

Ahh, in the last comment, one of the double asterisks was eaten by the rich markup interpretation. It should read: 2 to the power of 33.

ericvaandering commented 10 years ago

Comment by wildish on Fri Jan 20 05:04:14 2012

I just checked, and it seems we are using DBD::Oracle version 1.23. The latest version is 1.37, and it does have support for specifying the data-type to return from a query! So that looks like it might be our best solution. Of course, this will need some careful testing, to make sure we don't break anything with the upgrade.

We doubtless have a few other modules that are showing their age too, if we only look...

If that doesn't work out, or if we still have string-/num-ification issues (we might, from post-processing side effects), then I think the idea of using an array is a good one.

As for the choice of which fields to process, I'm more inclined to go towards option 1. I agree that we are never going to add block IDs etc, but it's not inconceivable that we might want to order them, or slice out a range between two values. That could happen if we paginate a display of a large number of blocks, for example.

Chih-Hao, could you take a look at the latest DBD::Oracle module, and see if it might solve our problem? I don't know if it's easy enough to use, or if it has too many consequences for the rest of our code, or what...

ericvaandering commented 10 years ago

Comment by huangch on Tue Feb 21 22:35:36 2012

The solution was to create a statement handle wrapper class which remembers the types of the fields and performs numification in fetchrow() and fetchall() methods. Once instantiated by a statement handle, it looks and works like a statement handle object that returns numbers in correct types.

PHEDEX::Web::THM.pm is created for this purpose. PHEDEX::Web::SQL.pm has been updated to use this wrapper class.

The change has been committed to CVS and will be deployed in the next release.