cloudant / python-cloudant

A Python library for Cloudant and CouchDB
Apache License 2.0
163 stars 55 forks source link

MemoryError #295

Closed farhankhwaja closed 7 years ago

farhankhwaja commented 7 years ago

Please include the following information in your ticket.

I was trying to download all the documents from my DB which has 1.3 Mil documents. I awas able to download 610,000 documents.

if __name__ == "__main__":
    client = Cloudant(USERNAME, PASSWORD, account=USERNAME)
    client.connect()
    myDB = client[DB]

    csvFile = csv.writer(open("myDBData.csv", "wb+"))

    for i, document in enumerate(myDB):
        try:
            if(document["X"] != None):
                csvFile.writerow([document["X"]])
            else:
                csvFile.writerow([""])

            if (i+1) % 10000 == 0:
                print i+1
        except:
            print document
            break

ERROR

Traceback (most recent call last):
  File "myDBDataFetch.py", line 15, in <module>
    for i, document in enumerate(myDB):
  File "D:\Python2\lib\site-packages\cloudant\database.py", line 631, in __iter__
    startkey=next_startkey
  File "D:\Python2\lib\site-packages\cloudant\database.py", line 389, in all_docs
    return resp.json()
  File "D:\Python2\lib\site-packages\requests\models.py", line 826, in json
    return complexjson.loads(self.text, **kwargs)
  File "D:\Python2\lib\site-packages\requests\models.py", line 791, in text
    content = str(self.content, encoding, errors='replace')
MemoryError

Can anyone help me with this.

alfinkel commented 7 years ago

Iterating over the database object myDB accomplishes two things:

  1. It retrieves all of the documents from the remote database. (desired)
  2. It also caches the retrieved documents in the local myDB cache. Remember that myDB is at its core a dict with some fancy functionality added in. (probably not desired for your case??)

While obviously retrieving your documents is the desired behavior here, I think that caching those documents locally is what is causing your eventual MemoryError.

My suggestion to you is to iterate over a Result object instead. Doing this provides you with similar behavior to bullet 1 with none of the side effects of bullet 2. There are two ways you can do this:

Via the database custom_result context manager, for example:

with myDB.custom_result(include_docs=True) as results:
    for result in results:
        ...

Via a Result object directly, for example:

    results = Result(myDB.all_docs, include_docs=True)
    for result in results:
        ...

These two approaches essentially do the same thing. Have a look at the custom_result and the Result to compare if you are interested in the specifics.

I hope that resolves your memory issue.