Include array index - Githubissues

ICRAR / ijson

Iterative JSON parser with Pythonic interfaces

http://pypi.python.org/pypi/ijson/

Other

852 stars 51 forks source link

Include array index #107

Open Minitour opened 11 months ago

Minitour commented 11 months ago

Is your feature request related to a problem? Please describe. I am streaming values from a large JSON file into a dataframe, but I am unable to group relevant items together due to lack of depth.

Describe the solution you'd like For example, instead of A.item.B.item.C which can be repeated many times.

It would be great to have something like: A.[0].B.[0].C

For example for the following object:

{
   "A": [
         {
              "B": [
                     { "C": "Test-1" },
                     { "C": "Test-2" },
                     { "C": "Test-3" }
               ]
         },
         {
              "B": [
                     { "C": "Test-4" },
                     { "C": "Test-5" },
                     { "C": "Test-6" }
               ]
         }
   ]
}

I would expect to see the following events:

Prefix	Name	Value
`A.[0].B.[0].C`	`string`	`Test-1`
`A.[0].B.[1].C`	`string`	`Test-2`
`A.[0].B.[2].C`	`string`	`Test-3`
`A.[1].B.[0].C`	`string`	`Test-4`
`A.[1].B.[1].C`	`string`	`Test-5`
`A.[1].B.[2].C`	`string`	`Test-6`

Describe alternatives you've considered N/A

Minitour commented 11 months ago

Update:

I hacked something real quick by modifying the common.py:

class Index:

    def __init__(self, initial_value=0):
        self._value = initial_value

    def increment(self):
        self._value += 1

    def decrement(self):
        self._value -= 1

    def __str__(self):
        return f'{self._value}'

@utils.coroutine
def parse_basecoro(target):
    path = []
    while True:
        event, value = yield
        if event == 'map_key':
            prefix = '.'.join(map(str, path[:-1]))
            path[-1] = value
        elif event == 'start_map':
            if path and (indx := path[-1]) and type(indx) == Index:
                indx.increment()
            prefix = '.'.join(map(str, path))
            path.append(None)
        elif event == 'end_map':
            path.pop()
            prefix = '.'.join(map(str, path))
        elif event == 'start_array':
            prefix = '.'.join(map(str, path))
            path.append(Index(0))
        elif event == 'end_array':
            path.pop()
            prefix = '.'.join(map(str, path))
        else:  # any scalar value
            prefix = '.'.join(map(str, path))
        target.send((prefix, event, value))

Although it is not the best solution, it certainly achieves what I am looking for. Please consider adding something similar, but in the meantime, I will be using patch to monkey-patch the library.

rtobar commented 11 months ago

Hi @Minitour, thanks for taking an interest in improving ijson!

I think the idea is good in principle, but the suggested implementation is not going to fly. In particular:

It breaks code for users of the items and kvitems calls, and that's an absolute no.
It also breaks code for users of the parser calls, and that's also an absolute no.
I'm not sure if you're aware, but modifying common.py applies the changes to all backends except yajl2_c, which is the default one (because it's 10x faster than the next one in the list).

If I implemented this, I'd do it at the items/kvitems level, where you could interpret the [n]s in the given prefix and match them to the nth appearance of item in the underlying path. Also, maybe instead of a.b.[0].c one could simply have a.b.0.c? The brackets seem unnecessary.

In any case, I'm in no hurry to implement this. Maybe if more people somehow upvote this I could give it some attention. It would also be an incentive if someone (you?) presented a modified version of items_basecoro that understood these numeric indices as indicated above, hopefully with tests -- then we could iterate into a final solution that covered all backends.

doerwalter commented 1 month ago

There's an RFC for specifying a path through a JSON object: RFC 6901 JSON Pointer https://www.rfc-editor.org/rfc/rfc6901

The JSON Pointer syntax for the example A.[0].B.[0].C above is /A/0/B/0/C.

It would be great to have support for that, but that would probably have to be via new functions.

I myself would prefer a different form of path info Simply a list of the keys that I need to get from the root to the node in question, ie.

["A", 0, "B", 0, "C"]