cerbero90 / json-parser

🧩 Zero-dependencies lazy parser to read JSON of any dimension and from any source in a memory-efficient way.
MIT License
671 stars 11 forks source link

Performance? #5

Open daniel-sc opened 4 months ago

daniel-sc commented 4 months ago

Detailed description

After using this library to iterate larger JSON responses from Guzzle (using the pointer /- on \Psr\Http\Message\StreamInterface), I noticed a significant reduction in memory usage (great!), but as well a significant increase in CPU load. Is this expected? What are possible mitigation strategies? Would different options (chunk size, lazy pointers, ..) help?

Context

Of course generally the "space-time tradeoff" implies some increase in CPU, but I expected this to be not significant..

Possible implementation

Possibly a section on this in the readme would help, maybe even a comparison to the alternatives to see if this is on the same level?

Your environment

cerbero90 commented 4 months ago

Hi @daniel-sc and thank you for your donation, it makes me happy that JSON Parser is having a positive impact on your project :)

As you mentioned, there is always a trade-off between memory and computation usage.

Would you be able to quantify your CPU usage before and after leveraging JSON Parser?

daniel-sc commented 4 months ago

I made a quick reproduction/example: https://github.com/daniel-sc/json-parser-performance-evaluation There I see almost a 100x increase in CPU usage:

C:\manual_programs\php\php-8.1.9-nts-Win32-vs16-x64\php.exe C:\dev\json-parser-performance-evaluation\run.php
input size: 19241
json_decode
clock time sec: 0.0010809898376465
CPU time ms: 0 ms

JsonParser
clock time sec: 0.097460031509399
CPU time ms: 62.5 ms
cerbero90 commented 4 months ago

Thanks for the example @daniel-sc 👍

Yes, I can confirm that the increase of CPU usage is expected.

json_decode() is part of the JSON extension included in PHP. It uses C internally to decode JSON so it's way faster, but it needs to load the entire JSON in memory to work properly and it will exhaust the memory if the JSON to decode is too large.

On the other side, JSON Parser under the hood parses 1 character at a time using PHP (and not C) so it needs more computations and it's expectedly slower.

The end goal of JSON Parser is not being fast or low in computations, but being able to read any JSON thrown at it, regardless of its size or source. 👍

daniel-sc commented 4 months ago

@cerbero90 I understand that CPU performance is/was not the main goal - non the less I think it is relevant, especially when handling large inputs.

It seems the alternatives optimized a little more:

json_decode:
memory peak: 1.234375 kB
clock time sec: 0.20915102958679
CPU time ms: 125 ms

JsonParser:
memory peak: 3724.5859375 kB
clock time sec: 11.772974967957
CPU time ms: 5203.125 ms

json-machine:
memory peak: 214.3359375 kB
clock time sec: 2.2307989597321
CPU time ms: 859.375 ms

jsonstreamingparser:
memory peak: 78.609375 kB
clock time sec: 6.5879240036011
CPU time ms: 3046.875 ms

Of course, I'd totally understand, if you have no time to go down this rabbit hole :D

cerbero90 commented 4 months ago

Thanks for looking into this @daniel-sc :)

Looking at your benchmark script makes me wonder whether parsing a large JSON once would be a more realistic use case rather than parsing a small JSON 1,000 times 👍

When building JSON Parser, I noticed that PHP has a pronounced function call overhead. On normal applications we don't have to worry about it, but on low-level logics - like parsing a JSON - it is quite significant.

JSON Machine, for example, manages to spend less resources by avoiding method calls as much as possible. That is an effective way to deal with the overhead but it works at the expense of readability, extendibility and maintainability.

In short, we may need to find a way to mitigate the resource usage while preserving a good software architecture.

I appreciate your help and your interest, Daniel! Hopefully I'll find some time to profile JSON Parser and find potential bottlenecks.

Meanwhile if you could recalibrate your benchmark, that would definitely help to spot unexpected resources usage 👍

daniel-sc commented 4 months ago

I updated the comparison for longer inputs - run it with php run.php 10 50 to iterate 10 times with the example concatenated 50 times. The duration per byte seem relatively stable.