Rewrite core parser - Githubissues

boomskats commented 8 years ago

Currently we're loading all macro params into a single-variable work dataset and using prx to work through one 32k chunk at a time. DS2 might give us a way of parallelising the processing of multiple frames when datasets exceed the 32k limit.

Also look at JSON package in 9.4m3

boomskats commented 7 years ago

We have decided that slightly better prep/preprocessing clientside is a better idea - development branch

FriedEgg commented 7 years ago

I agree not to pursue this. The parser that is part of the DS2 JSON package is terrible! You may consider using the JSON library engine, assuming that the data structures being transmitted do not require custom mapping (which I don't think they would for the intended purposes here) it would provide the simplest method for digesting. When sending data rows to SAS you could even emulate the schema that SAS uses itself when outputting data to JSON via PROC JSON (for example)

vnevlev commented 7 years ago

If you move away from your custom solution to DS2 or JSON library approach, that would make H54S depended on SAS 9.4 release, making it unusable to anyone with SAS version less than 9.4. Is it worth it?

On Tue, 14 Mar 2017 at 22:15 Fried Egg notifications@github.com wrote:

I agree not to pursue this. The parser that is part of the DS2 JSON package is terrible! You may consider using the JSON library engine, assuming that the data structures being transmitted do not require custom mapping (which I don't think they would for the intended purposes here) it would provide the simplest method for digesting. When sending data rows to SAS you could even emulate the schema that SAS uses itself when outputting data to JSON via PROC JSON (for example)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Boemska/h54s/issues/26#issuecomment-286563212, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHy2RLK5FKRvoXhGeTuoHCdJUhW6Th7ks5rlwOKgaJpZM4G7cAD .

FriedEgg commented 7 years ago

@vnevlev that is a valid point on concern. The DS2 package was not released until 9.4M3 (Jul2015) and the libname was not added until 9.4M4 (Nov2016) which makes it somewhat 'cutting-edge' as far as Base SAS is concerned.

boomskats commented 7 years ago

@FriedEgg @vnevlev Agreeing with both of you here. I think spinning up DS2 each time we want to deserialise a dataset is overkill, and we don't need the full support of the libname engine as we don't need to adhere to strict JSON spec here since we're able to prep our dataset at the frontend.

For parsing we're testing a way simpler data step-based solution which was suggested by Bruno Müller as part of this SAS Communities thread. So we're changing a couple of fundamental parser bits here:

different client-side preparation to mask SAS-sensitive chars (ie. replace \" with "", some other bits), instead of relying on escape() and unescape() and the urlencode and urldecode SAS functions for masking & unmasking of special characters
for the most part getting rid of parameter/macro based processing and sending data by constructing file upload bodies as part of a multipart post payload. Parameters are only used for body metadata, not the data itself. Amongst other things like the potential for full utf8 support, this change results in a very decent performance boost:

2mb of data on the old parser used to take a single thread up to 15 seconds at full blast. Now using the ('new') data step method it takes 150ms ish, veeery little CPU and less memory.

before ===> <=== after ===> <===

(cpu is red, res mem is orange)

This looks like a parser performance boost of epic proportions, but I want to do some more real world load testing with jmeter to see what the implications are on JVM & spawner stability. Looking very very good at the moment though - if nothing else, it massively reduces the startup cost of the deserialiser, which tends to be spun up with each request to the server.

Anecdotally, to parse the same JSON in Python:

with open('data.json') as data_file:    
    data = json.load(data_file)

...takes 3.5x longer than the new DS method.

On the downside, we're replacing our 'masked json' output macro with PROC JSON output, which only came about in 9.3. If anyone requires 9.2 support we can maintain a legacy branch that uses the old output method, it's not a huge deal.

PR to the development branch coming in the next few days. Been playing with this since October but we've just kicked off a project requiring proper UTF8 support, so...

vnevlev commented 7 years ago

That's good news on performance. The only thing to add, PROC JSON wasn't on 9.3, it was on M1 as experimental. I remember trying to use it on 9.3M1 and wasn't doing a very good job.

On Sat, 18 Mar 2017, 12:34 Nikola Markovic, notifications@github.com wrote:

@FriedEgg https://github.com/FriedEgg @vnevlev https://github.com/vnevlev Agreeing with both of you here. I think spinning up DS2 each time we want to deserialise a dataset is overkill, and we don't need the full support of the libname engine as we don't need to adhere to strict JSON spec here since we're able to prep our dataset at the frontend.

For parsing we're testing a way simpler data step-based solution which was suggested by Bruno Müller as part of this SAS Communities thread https://communities.sas.com/t5/SAS-Enterprise-Guide/Reading-json-file-into-SAS-using-SAS-enterprise-guide/m-p/205050/highlight/true#M15389. So we're changing a couple of fundamental parser bits here:

different client-side preparation to mask SAS-sensitive chars (ie. replace \" with "", some other bits), instead of relying on escape() and unescape() and the urlencode and urldecode SAS functions for masking & unmasking of special characters

for the most part getting rid of parameter/macro based processing and sending data by constructing file upload bodies as part of a multipart post payload. Parameters are only used for body metadata, not the data itself. Amongst other things like the potential for full utf8 support, this change results in a very decent performance boost:

2mb of data on the old parser used to take a single thread up to 15 seconds at full blast. Now using the ('new') data step method it takes 150ms ish, veeery little CPU and less memory.

before ===> [image: image] https://cloud.githubusercontent.com/assets/11962123/24071822/08e85c4a-0bd3-11e7-9391-17fa6835fcf5.png <=== after ===> [image: image] https://cloud.githubusercontent.com/assets/11962123/24071840/54e6c76c-0bd3-11e7-9d63-c34b30030bb7.png <===

(cpu is red, res mem is orange) usage This looks like a parser performance boost of epic proportions, but I want to do some more real world load testing with jmeter to see what the implications are on JVM & spawner stability. Looking very very good at the moment though - if nothing else, it massively reduces the startup cost of the deserialiser, which tends to be spun up with each request to the server.

On the downside, we're replacing our 'masked json' output macro with PROC JSON output, which only came about in 9.3. If anyone requires 9.2 support we can maintain a legacy branch that uses the old output method, it's not a huge deal.

PR to the development branch coming in the next few days. Been playing with this since October but we've just kicked off a project requiring proper UTF8 support, so...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Boemska/h54s/issues/26#issuecomment-287543043, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHy2QE54GU88luBYbh3hZ_A7o6BSxqqks5rm89WgaJpZM4G7cAD .

boomskats commented 7 years ago

Interesting. We're running UTF8 integration tests, and while the majority of it works great, certain characters still trip PROC JSON up.

This was the last couple of characters it returned before it errored:

۝۞ۣ۟۠ۡۢۤ"

Seems like it's doing a better job, but it might just be that we need better tests...

boomskats commented 7 years ago

False alarm. UTF-8 is complicated :8ball:

boomskats commented 6 years ago

False false alarm, it's not that complicated. You just have to specify length to SAS in bytes rather than characters

cj13579 commented 4 years ago

In the 2.0 release of the adapter (which we just released) we have moved away from parameters and have introduced a dependency on PROC JSON for serializing output. As per this discussion, we are aware that this introduces a dependency on SAS9.4.

Developers wishing to use h54s to build applications on older versions of SAS will still be able to use v1.0 or v0.11.3 of the adapter.

Closing this issue.

Boemska / h54s

Rewrite core parser #26