ER: .[] enhancements - Githubissues

pkoppstein commented 1 year ago

As best I can tell, there is no command-line JSON tool that can can speedily and losslessly run the equivalent of .[] against arbitrarily large files with a single JSON entity for which each value in the stream that should be produced by .[] is relatively small.(*)

As a test case, consider 1e9.json generated by:

jq -nr '"[", (range(0;1E9) | "(.),"), "0]"' > 1e9.json. # 10,888,888,895 bytes

Interestingly:

jaq fails to complete .[] because it quickly becomes a memory glutton (I watched it grow to consume 48GB memory);
jq takes over 24 hours to complete .[] (it requires about 8GB of memory);
jq --stream takes over 2.5 hours to run the equivalent of .[]

I realize that jaq currently has no "streaming" ambitions, but it would be fantastic from many points of view if jaq could be enhanced to fill what seems to be the current void amongst command-line tools for JSON.

A closely related issue concerns enormous files with a single JSON object with a large number of keys, each value of which is relatively small. In such cases, one would like to be able to use a small-footprint version of to_entries[].

An alternative would be a built-in for economically producing single-key objects as if by keys_unsorted[] as $k | {($k): .[$k]}. Unfortunately I haven't been able to come up with a wonderful name for such a built-in. Perhaps singletons?

Thanks.

(*) jstream comes close, but it loses precision for both very large integers and various other numbers.

There are language-specific JSON libraries for programmers that have some support for streaming large files, but as best I can tell, they are either quite difficult to use for those who don't know the specific language, or do not handle numeric literals losslessly.

[CORRECTION:] The "JSON Machine" is a library for PHP that can be configured to work losslessly. This is done by the PHP script I wrote at https://github.com/pkoppstein/jm

pkoppstein commented 1 year ago

Since opening this issue, I've managed to use "JSON Machine" to write a script which, more-or-less, meets the main requirements mentioned in the initial post in this thread. (See Version 0.0.4 or later of jm at https://github.com/halaxa/json-machine/issues/88).

jm is slightly faster than jq --stream (e.g. on a large JSON array: jm:90 mins vs jq --stream: 150 mins) and has two additional advantages: numerical precision is preserved, and it's trivial to use.

01mf02 commented 1 year ago

While it is strange that jaq uses so much more memory (did you test it with jaq 0.8.2?), I do not have any idea how to improve that right now. I also believe it is better to use another tool than jq here, such as you did.

pkoppstein commented 1 year ago

Yes, I've tried jaq 0.8.2.

Please note that for a top-level array, low-level parsing for .[] should actually be very easy: it's basically just a matter of finding the beginning and ending of each top-level entity in the array. Perhaps you might get some inspiration from github/json-machine/src/TokensWithDebugging.php

It's easy-going as it's just PHP, but that makes it slower than one would like. And that's one of the reasons why I still have some hope that you might be interested :-)

01mf02 commented 1 year ago

I understand that some special case for .[] might seem easy and comfortable to implement at first sight. However, I am against special cases like that. Why? Because they can easily result in quite unpredictable behavior and bugs. Because we have to assure that the special case for .[] behaves like the general case for .[] (only faster and using less RAM). Also, we could have something like 0 as $x | .[], then what do we do? Does the special case for .[] still apply? That means that we have to detect which filters do not access their input (because you could also write length as $x | .[], which would break the streaming behavior), and furthermore, how many outputs a filter produces (because you could write (0, 1) as $x | .[], where it is not clear what to do --- parse the file twice?). The second property is even undecidable! This makes the code base much more complicated, and certainly it makes the behavior less predictable.

I quote the third jaq goal, namely simplicity: "jaq aims to have a simple and small implementation, in order to reduce the potential for bugs and to facilitate contributions." I could add that another motivation for simplicity is to avoid surprises. And having something like special cases for .[] would be definitely a surprise to me if I would not know how this is implemented.

I hope that you understand. :)

01mf02 commented 1 year ago

My ambition is that jaq covers only JSON data where each single value fits into memory. Not more, not less.

pkoppstein commented 1 year ago

@01mf02 - Thanks for your responses. I've been unable to reply until now, but would still like to continue the discussion.

Since you've indicated that "big data" blobs of JSON are outside the scope of jaq, I'll pursue another tack. Given that there are such blobs, it's clear that a complementary tool is needed. As I've explained, although my script jm comes close, it is unsatisfactory for two main reasons, both tied to its PHP dependence: (1) it's (unnecessarily) slow; and (2) it's not easy to install.

Since you evidently have the experience and talent to develop a fast tool in Rust (thus overcoming both the above-mentioned limitations), I still dare to hope that you might be interested, especially since its scope is extremely limited. Specifically, it would only be expected to handle a single JSON array or object as input: -- for a JSON array as input, it would behave like jq -c '.[]' -- for a JSON object as input, it would behave like jq -c 'to_entries[] | {(.key): .value}'

The only other signficant requirement is that it should preserve the numerical precision at least of integers.

wader commented 1 year ago

@pkoppstein Interesting problem. I was playing around a bit with a hybrid of stream and fromjson. Was "jq --stream takes over 2.5 hours to run the equivalent of .[]" some like below?

# same as fromstream but with a prefix level, outputs [$prefix_path, $json_value], ...
# Usage:
# Test file https://github.com/json-iterator/test-data/blob/master/large-file.json
# cat big | jq -L . --stream -n 'include "stream"; reduce fromstream(inputs; 1) as $pv ({}; $pv as [$p, $v] | .[$v.type] += 1)'

def streaks_by(f; g):
  foreach ((g | ["v", .]), ["end"]) as $o (
    { acc: null
    , fv: null
    , emit: null
    };
    ( .emit = null
    | $o as [$t, $v]
    | ($v | f) as $fv
    | if $t == "end" then .emit = .acc
      elif .acc == null then .acc = [$v]
      else
        if $fv == .fv then .acc += [$v]
        else
          ( .emit = .acc
          | .acc = [$v]
          )
        end
      end
    | .fv = $fv
    );
    .emit // empty
  );

def fromstream(g; $level):
  ( streaks_by(
      .[0][0:$level];
      g
    )
  | [ .[0][0][0:$level]
    , fromstream(.[] | .[0] |= .[$level:])
    ]
  );

Is not very fast but i wonder how hard it would be to build something like that "natively". Another uglier but probably faster way would be some native version that can read directly (from stdin etc) and not via a JSON stream.

pkoppstein commented 1 year ago

@wader - The jq --stream program I used is the one mentioned in the jq FAQ:

jq -cn --stream 'fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) )'

Anyway, it's well-known that jq's streaming parser skews the space-time tradeoff towards economizing space at the cost of speed.

Since jq is written in C, it might be worth mentioning the jsmn library: https://github.com/zserge/jsmn

01mf02 commented 1 year ago

@pkoppstein, I'm happy to tell you that I just created (a very rough first draft of) a utility that should directly address your use case! It is (for now) an example program contained in my new JSON parsing library hifijson. You can try it by cloning the hifijson repository and running

cargo run --release --example cat -- --path [] <FILE(S)>

The path can be something like a jq path, for example, you can use something like [1]["a", "b"][]. The nice thing is that this program runs without (nearly) constant memory usage and minimal latency. (To be precise, its maximal memory usage is the size of the largest string / number in the input.)

I tried it by generating a large file (85MB) via

jq -nr '"[", (range(0;10000000) | tostring + ","), "0]"' > bla.json

and ran my program via

cargo run --release --example cat -- --path [] bla.json

Unlike jq '.[]' bla.json, which took about 5 seconds before giving the first output, my new program gives you back the first value immediately.

I hope it is useful to you. Sorry for the rough edges, just finished it. It took me only one hour to write! ;)

pkoppstein commented 1 year ago

On Tue, Feb 7, 2023 at 8:16 AM Michael Färber @.***> wrote:

@pkoppstein https://github.com/pkoppstein, I'm happy to tell you that I just created (a very rough first draft of) a utility that should directly address your use case!

Thank you very much for letting me know.

One of the tests I ran against a large file (> 400MB) with a very long (10^6) array printed the first item snappily, but took a while to complete:

$ time cargo run --release --example cat -- --path [0] /Volumes/SSD1TB/tmp/a26properties.json Finished release [optimized] target(s) in 0.05s Running target/release/examples/cat --path '[0]' /Volumes/SSD1TB/tmp/a26properties.json {"a":0,"b":1,"c":2,"d":3,"e":4,"f":5,"g":6,"h":7,"i":8,"j":9,"k":10,"l":11,"m":12,"n":13,"o":14,"p":15,"q":16,"r":17,"s":18,"t":19,"u":20,"v":21,"w":22,"x":23,"y":24,"z":25}

real 0m2.971s user 0m2.134s sys 0m0.224s

I'm also puzzled by the following:

cargo run --release --example cat -- --path [100] <<< '{"a":1,"b":2}' Finished release [optimized] target(s) in 0.06s Running target/release/examples/cat --path '[100]' 1 2

Thank you again!

Peter h: 609 799-2769 c: 609 933-5886

01mf02 commented 1 year ago

One of the tests I ran against a large file (> 400MB) with a very long (10^6) array printed the first item snappily, but took a while to complete: [...]

Do you know how long for example jq takes for the same task?

I'm also puzzled by the following: [...]

This is one of the "rough edges" I was talking about. In particular, you can mix numbers and strings in paths, for example [1, 2, "a", "b"]. In that case, when giving [1, 2, 3, 4], this would output 2 3 (because the path contains 1 and 2), whereas when giving {"a": 1, "b": 2, "c": 3}, this would output 1 2 (because the path contains "a" and "b"). In a nutshell, for arrays, only numbers in the path are considered, whereas for objects, only strings in the path are considered. When you now have an array, say [1, 2, 3], but your path is ["a"], this is interpreted as an empty path, [], because it contains no numbers and we are dealing with an array as input.

How would you go about this? Should these mixed number-string paths be allowed?

pkoppstein commented 1 year ago

Do you know how long for example jq takes for the same task?

Yes. In this case (finding .[0]), jq with the --stream option can be made very fast:

time jq -cn --stream 'first(fromstream(1|truncate_stream(inputs)))' /Volumes/SSD1TB/tmp/a26properties.json
{"a":0,"b":1,"c":2,"d":3,"e":4,"f":5,"g":6,"h":7,"i":8,"j":9,"k":10,"l":11,"m":12,"n":13,"o":14,"p":15,"q":16,"r":17,"s":18,"t":19,"u":20,"v":21,"w":22,"x":23,"y":24,"z":25}

user    0m0.036s
sys 0m0.005s

Similarly for nth(_;_) provided N is small.

The problem with jq's streaming parser is that it insists on fully parsing the items it comes across. For example, suppose we only want .[100000]; jq will laboriously have to parse all the prior items. Similarly, if we want to splat the entire array, a "smart splatter" would avoid actually parsing the items, and instead use counters to find the relevant beginnings and endings.

01mf02 / jaq

ER: .[] enhancements #47