ICRAR / ijson

Iterative JSON parser with Pythonic interfaces
http://pypi.python.org/pypi/ijson/
Other
852 stars 51 forks source link

Nested structure reading #78

Closed adam-mrozik closed 2 years ago

adam-mrozik commented 2 years ago

Hey,

I have a problem which I am trying to solve with ijson. Let's say I have this kind of file:

{
"planet": "earth"
"countries": [
{ "country": "usa", "continent': "NA", "cities": [ {"city": "New York", ... some other city_properties}, ...other_cities]}, ...other_countries
]
}

So, multiple countries and each has multiple cities.

And, I want to submit them to database as flat rows, e.g.:

{"country": "usa", "city": "New York", "planet": "earth", ...}

I see few approaches, but each has its issues (mostly due to my unconventional data structure):

Solution 1:

import ijson

f = urlopen('http://file.json')
countries = ijson.items(f, 'countries')
planet = ijson.items(f, 'planet')
data = {}
data['planet'] = planet
for country in countries:
    # append country specific data like continent
    for city in country:
        # append city specificdata

        # yield singular "row"
        yield data

This solution only helps partially, because while each country is serialized separately, they can still have a lot of cities, making streaming not that useful. Also, I am not sure what would be performance of planet = ijson.items(f, 'planet') line if planet key was at the end of file. Would it traverse one more time?

Solution 2: Parsing

f = urlopen(url)
country_data = ijson.parse(f, urlopen(url))
data = {
}
for prefix, event, value in country_data:
    if prefix == "planet":
        data['planet'] = value
    if prefix == "countries.item.continent":
        data['continent'] = value

    ... # add some other flat data like above

    # if one city ends, another begins. This is were it should be yielded
    if prefix == "countries.item.cities.item" and event == "end_map":
        yield data

In this solution, ijson traverses through the file and I can easily pinpoint place at which new city row should be sent. However, what if top planet field is not at the beginning, but at the end? In this case this solution does not work, because parser will not see it until the end of the file. Ideally in this example, keys would be ordered so that both countries and cities are at the bottom of their respective levels, but I do not see a way to do it unknowingly

rtobar commented 2 years ago

I am not sure what would be performance of planet = ijson.items(f, 'planet') line if planet key was at the end of file. Would it traverse one more time?

Indeed. You'd need to fetch the content twice, or at least multiplex it into the two different ijson.items calls, as each invocation to ijson.items fully consumes the given stream of data. Because of this you'll probably want to go with Solution 2.

[Solution 2] However, what if top planet field is not at the beginning, but at the end?

Indeed that would be the worst case scenario. If you want to have this information appear on each of the flat records you are storing in your database then you'd have to accumulate all the records in memory and wait for planet to appear, so you can update the records and put them in the database. An alternative approach, if your database schema allows it, would be to write the city records without a planet, and issue an UPDATE to set their planet when you finally find it.

As you noticed this is not a problem of ijson itself, but the fact that your document wouldn't be well suited for streaming. Still, if you can write planet-less records and then issue an update you should be fine to go.

rtobar commented 2 years ago

Closing since this was answered a long time ago.