ICRAR / ijson

Iterative JSON parser with Pythonic interfaces
http://pypi.python.org/pypi/ijson/
Other
830 stars 51 forks source link

Include array index #107

Open Minitour opened 10 months ago

Minitour commented 10 months ago

Is your feature request related to a problem? Please describe. I am streaming values from a large JSON file into a dataframe, but I am unable to group relevant items together due to lack of depth.

Describe the solution you'd like For example, instead of A.item.B.item.C which can be repeated many times.

It would be great to have something like: A.[0].B.[0].C

For example for the following object:

{
   "A": [
         {
              "B": [
                     { "C": "Test-1" },
                     { "C": "Test-2" },
                     { "C": "Test-3" }
               ]
         },
         {
              "B": [
                     { "C": "Test-4" },
                     { "C": "Test-5" },
                     { "C": "Test-6" }
               ]
         }
   ]
}

I would expect to see the following events:

Prefix Name Value
A.[0].B.[0].C string Test-1
A.[0].B.[1].C string Test-2
A.[0].B.[2].C string Test-3
A.[1].B.[0].C string Test-4
A.[1].B.[1].C string Test-5
A.[1].B.[2].C string Test-6

Describe alternatives you've considered N/A

Minitour commented 10 months ago

Update:

I hacked something real quick by modifying the common.py:

class Index:

    def __init__(self, initial_value=0):
        self._value = initial_value

    def increment(self):
        self._value += 1

    def decrement(self):
        self._value -= 1

    def __str__(self):
        return f'{self._value}'

@utils.coroutine
def parse_basecoro(target):
    path = []
    while True:
        event, value = yield
        if event == 'map_key':
            prefix = '.'.join(map(str, path[:-1]))
            path[-1] = value
        elif event == 'start_map':
            if path and (indx := path[-1]) and type(indx) == Index:
                indx.increment()
            prefix = '.'.join(map(str, path))
            path.append(None)
        elif event == 'end_map':
            path.pop()
            prefix = '.'.join(map(str, path))
        elif event == 'start_array':
            prefix = '.'.join(map(str, path))
            path.append(Index(0))
        elif event == 'end_array':
            path.pop()
            prefix = '.'.join(map(str, path))
        else:  # any scalar value
            prefix = '.'.join(map(str, path))
        target.send((prefix, event, value))

Although it is not the best solution, it certainly achieves what I am looking for. Please consider adding something similar, but in the meantime, I will be using patch to monkey-patch the library.

rtobar commented 9 months ago

Hi @Minitour, thanks for taking an interest in improving ijson!

I think the idea is good in principle, but the suggested implementation is not going to fly. In particular:

If I implemented this, I'd do it at the items/kvitems level, where you could interpret the [n]s in the given prefix and match them to the nth appearance of item in the underlying path. Also, maybe instead of a.b.[0].c one could simply have a.b.0.c? The brackets seem unnecessary.

In any case, I'm in no hurry to implement this. Maybe if more people somehow upvote this I could give it some attention. It would also be an incentive if someone (you?) presented a modified version of items_basecoro that understood these numeric indices as indicated above, hopefully with tests -- then we could iterate into a final solution that covered all backends.