adamritter / fastgron

High-performance JSON to GRON (greppable, flattened JSON) converter
MIT License
592 stars 10 forks source link

fastgron -u doesn't always output correct JSON #19

Closed adamritter closed 9 months ago

adamritter commented 9 months ago

fastgron: 8.5s 2.2G residentedit: Interestingly whilst doing this test, I piped the output into fastgron -u (39.5G resident) and jq rejected that. Will have to investigate further but it's a bit of a flaw if it can't rehydrate its own output into valid JSON.

https://news.ycombinator.com/reply?id=38534721&goto=threads%3Fid%3Dxiphias2%2338534721

adamritter commented 9 months ago

One thing that I know is that path parsing doesn't handle quoting well (if a map contains a key with " or ), but there might be other errors.

It would probably good to test on a diverse set of JSONs, compare with gron output and also compare with fastgron | fastrgon -u | fastgron output

rjp commented 9 months ago

Just came here to post this issue but I see my comment has beat me to it. If it helps, I can upload my source file somewhere (a Minecraft region file converted to JSON with fastnbt-tools, 102M when zstd'd).

Seems to be doubling up quotes in the output for some reason. fastgron creates the output correctly:

json[1024].block_entities[14].RecipesUsed["minecraft:cooked_cod"] = 142

But then fastgron -u outputs this:

        "RecipesUsed": {
          "minecraft:cooked_cod"": 142,

I did think that it might be the colon because that's the first occurrence of a key with the colon in the output - some minimal tests suggest it might be.

This one gets double-quoted through fastgron | fastgron -u.

{
  "RecipesUsed": {
    "minecraft:ghost": 1
  }
}

This one does not.

{
  "RecipesUsed": {
    "minecraft_ghost": 1
  }
}

This one only gets double-quoted on the key with the colon.

{
  "RecipesUsed": {
    "minecraft_ghost": 1,
    "minecraft:test": 2,
    "minecraft_after": 3
  }
}
adamritter commented 9 months ago

Thanks for helping!

I introduced some bugs that should be now fixed in v0.7.5 . Could you please try that version?

rjp commented 9 months ago

The double-quoting seems to be fixed looking at my minimal test cases.

Giving it a run through the same Minecraft JSON, though, I'm now getting: jq: parse error: Unfinished JSON term at EOF at line 174299223, column 0

Output actually has 174299222 lines and finishes with unclosed items.

            -1,
            -1,
            -1,
            -1,
            -1,
            -1,
            -1,
            -1,
            -1,

Uploaded it here - might just be because it's huge that it's causing problems. Certainly some of the smaller files I've tested have been ok.

https://rjp-hosted-files.s3.amazonaws.com/pp.json.zst

adamritter commented 9 months ago

It works for me using v0.7.5 :

./build/fastgron -V fastgron version 0.7.5

ls -l pp* -rw-r--r-- 1 adamritter staff 8997290281 Dec 5 23:40 pp.gron -rw-r--r-- 1 adamritter staff 821280774 Dec 5 23:38 pp.json -rw-r--r-- 1 adamritter staff 3207448348 Dec 5 23:42 pp2.json (base) ➜ ~/Documents/GitHub/fastgron/fastgron git:(main) ✗ less pp2.json (base) ➜ ~/Documents/GitHub/fastgron/fastgron git:(main) ✗ less pp. (base) ➜ ~/Documents/GitHub/fastgron/fastgron git:(main) ✗ cat pp2.json| fastgron > /dev/null (base) ➜ ~/Documents/GitHub/fastgron/fastgron git:(main) ✗ cat pp2.json| jq > /dev/null cat pp2.json 0.05s user 0.62s system 1% cpu 57.335 total jq > /dev/null 127.55s user 1.65s system 99% cpu 2:09.75 total

adamritter commented 9 months ago

It's very strange, it looks like ungron doesn't keep Heightmaps.MOTION_BLOCKING_NO_LEAVES, only Heightmaps.MOTION_BLOCKING sometimes.

A minimal example is here (m has to be a substring of m2 for this bug to appear):

json[0] = {} json[0].m = null json[0].m2 = null

fastgron --ungron pp01.gron [ { "m": null } ]

adamritter commented 9 months ago

I released v0.7.6 which both improves on --ungron time and fixes the bug which is a path optimization (m2 is not a subpath of m).

I did an md5sum to make sure that now --ungron and then fastgron on a sorted output gives the same result:

md5sum pp.gron cf8c0520c75d909596be7b87ecb4bd62 pp.gron (base) ➜ ~/Documents/GitHub/fastgron/fastgron git:(main) md5sum pp5.gron cf8c0520c75d909596be7b87ecb4bd62 pp5.gron

Could you please benchmark fastgron --ungron again? I hope it uses much less memory as well now with the fix.

rjp commented 9 months ago

Damn, beaten to the punch again, was just coming to post about missing keys although mine is LootTableSeed - there's 370 in the source JSON and only 339 after it's been through fastgron -u. BUT 0.7.6 reports 370.

If I pipe the source through jq -S | xxh64sum, I get the same as piping it through fastgron | fastgron -u | jq -S | xxh64sum which is perfect reconstruction.

rjp commented 9 months ago

On this M1 Pro, using pp.json.zst, /usr/bin/time -l reports 2.4G peak for fastgron and 19G peak for the fastgron -u. On my small linux server, it's still around 39G peak for fastgron -u (where it was for 0.7.4). May just be a difference in how macOS and Linux measure usage?

adamritter commented 9 months ago

Sounds great! I close this issue as fixed, but feel free to open new ones, or reply to my email if you have new ideas.

It's the memory fix in fastgron -u in the last version. 39G was suspicious, so I looked through the code, and I was passing an object by value instead of by reference, so it was copied, that took so much RAM.

The fix had a nice effect on speed as well, although I believe ungron could be still made significantly faster.