How to read json records in chunks using ijson?

Rstar1998 commented 1 year ago

I need to read a huge json file and insert in mongo db. I want to read the json records in chunks of 1 million or any number . How do I achieve such thing using ijson ?

So I have 2GB Json file which I need to load it to mongo database using python. I used the following piece of code

  import pandas as pd
  from pymongo import MongoClient
  import json
  from ast import literal_eval
  import ijson
  import time 

  client = MongoClient("mongodb://localhost:27017/")
  database = client['dfg']
  collection = database['xcv']

  start = time.time()

  with open("huge_json_data.json", "rb") as f:
      collection.insert_many([ record for record in ijson.items(f, "item")] )

  end = time.time()
  print(end - start)

  client.close()

The problem is that , this process takes a huge amount of time and memory since a 2GB file is read in a list and given to insert_many to load in mongodb. Is it possible to load the file in chunks of 10000 rows and insert ? Like

    with open("huge_json_data.json", "rb") as f:
        for chunk in ijson.items(f, "item",chunk_size=10000):
              collection.insert_many(chunk)

feel free to correct me if I am following wrong approach or is there any other solution by which I can solve my issue ?

data sample

[ 

    {
            "item_id": 1,
            "temp": 2,
            "time": "2023-03-14 00:00:00",
            "item_list": [
                {
                    "i_id": 0,
                    "i_name": "",
                    "i_brand": "",
                    "i_l_category": "",
                    "i_stock_qnty": 10,
                    "s": 0.90
                },
                {
                    "i_id": 1,
                    "i_name": "",
                    "i_brand": "",
                    "i_l_category": "",
                    "i_stock_qnty": 10,
                    "score": 0.90
                }

            ]
        }
    ,
    ......... 100000 such records

]

rtobar commented 1 year ago

@Rstar1998 please follow the advice given in the template: share what you've tried, ask more precise questions, hopefully also some example data, etc. With such a broad description there's little help you can get.

Rstar1998 commented 1 year ago

@rtobar I have updated my description. Let me know if any more info is needed.

rtobar commented 1 year ago

Thanks @Rstar1998, that's much clearer now :-)

The problem is that you are creating a single list with all the results, then feeding it to MongoDB. That is what's causing the problem, not the ijson iteration itself. What you need is to indeed chunk the results from the ijson iteration and feed those chunks to MongoDB.

To answer you direct question: no, ijson doesn't offer chunking itself. The good news is that we don't really need to, as this is a simple and common task. You could for example use itertools.islice for that, which doesn't require much work. Something like (taken from https://docs.python.org/3/library/itertools.html#itertools-recipes, see the one for "batched"):

    items = ijson.items(f, "item")
    while (batch := tuple(islice(items, n))):
        # insert batch into MongoDB

Rstar1998 commented 1 year ago

@rtobar . Thank you very much .

ICRAR / ijson

How to read json records in chunks using ijson? #89