SWI-Prolog / packages-http

The SWI-Prolog HTTP server and client libraries
24 stars 23 forks source link

json_read_dict high memory usage #125

Closed yonatan closed 5 years ago

yonatan commented 5 years ago

Hi,

Is it normal for json_read_dict to use 300MB RAM when parsing a 17MB json file? Or am I doing something wrong here?

Minimal test file: memtest.pl

:- use_module(library(http/json)).

json(X) :-
    open('d7.64-core.json', read, Stream), %% 17MB file
    json_read_dict(Stream, X),
    close(Stream).

Testing on Ubuntu 18.04 (note: /usr/bin/time is GNU time, not bash built-in time):

$ swipl --version
SWI-Prolog version 8.1.3 for x86_64-linux
$ /usr/bin/time -v -- swipl -g 'json(X)' -t halt memtest.pl
...
Maximum resident set size (kbytes): 296200
...
yonatan commented 5 years ago

If it matters - the JSON file I'm trying to read contains Javascript ASTs, it looks something like this but without all the whitespace:

{
  "type": "Program",
  "start": 0,
  "end": 476,
  "body": [
    {
      "type": "VariableDeclaration",
      "start": 179,
      "end": 389,
      "declarations": [
        {
          "type": "VariableDeclarator",
          "start": 183,
          "end": 388,
          "id": {
            "type": "Identifier",
            "start": 183,
            "end": 187,
            "name": "tips"
          },
JanWielemaker commented 5 years ago

I'm not totally surprised. The data structure for a JSON object is fairly big. Notable value strings are not cheap on a 64-bit machine as they reside on the stack with two guards, so a string of up to 7 characters takes 24 bytes + the pointer to it is 32 bytes. Integers take 8 bytes. Keys are shared, so it mainly depends on the number of different keys.

More importantly though, the dict is (still) created after parsing to the classical Prolog representation which is even more expensive, so the creation process takes even more memory. Finally, there is the choice of the system between garbage collecting and stack expansion that can easily temporary cause rather large stacks.

To know the real size, use term_size/2.

You can use the option value_string_as(atom) to represent values as atoms rather than strings. If there are many duplicates, this may safe memory, but if most values are unique it will cost more. Also the difference between true and "true" is lost.

P.s. Please use the forum for such questions.