Memory usage Runs wild - Githubissues

prggmr commented 12 years ago

Hello,

I believe to have found a problem with the library causing excessive memory usage when used with long running processes, the problem I encountered is when importing a very large XML file (80+GB) the memory usage would climb at a very steady rate, after exhausting all other options to what was consuming the memory I removed the peewee library and the memory consumption stopped.

I cannot say exactly where the consumption is taking place but I was only using the select, update and insert functions in the process, unfortunately I didn't have the time available to fully investigate what the problem is but it does exist somewhere.

During the import I was using the following snippet for extracting the data.

https://gist.github.com/2161849

The code for running the queries was identical to the following

if import.type is IMPORT_FULL Model.select().where(id = field['id']).exists():
    InsertQuery(Model, **fields).execute()
else:
    UpdateQuery(Model, **fields).where(id = fields['id']).execute()

coleifer commented 12 years ago

It definitely sounds like there's a problem somewhere. I'd like to know a little bit more before I set about trying to replicate -- what database engine were you using? Did you try dropping down to the db driver and issuing raw SQL queries, and if so did this fix the problem? Were you iterating over a large number of rows from the database?

Thanks for reporting, I'm eager to look into this.

prggmr commented 12 years ago

I was using MySQL (MySQLdb).

Dropping peewee and using the MySQLdb itself with plain SQL queries and no ORM interaction has solved the issue.

As for the database table size it slowly progressed regardless of table size but it ultimately failed when running on a table with ~8 million records with a total record set for the entire database being around ~12 million at the point of failure, with each record running at a minimum of 2 queries per record, a select and then an insert or update depending on if the record exists.

From my previous experience the import would run roughly for 3 - 4 hours before consuming the servers entire memory of 4GB currently it is about to hit the 6 hour mark without using more than 0.3% of the servers memory which is the expected behavior.

coleifer commented 12 years ago

Thanks for the info. I'll see about doing some profiling.

coleifer commented 12 years ago

Initial results -- I am not able to replicate your issue. Using python2.7, MySQLdb, peewee 0.9.4. Wrote a little script using guppy to do memory profiling. It shows memory before starting querying after initial objects have been loaded up, then shows memory usage after every 100K iterations (it does not change). I was surprised by my results so I modified the script to add items to a dictionary in the loop and observed memory growing, so I know that its working right.

Here's my script:

import random
from guppy import hpy
from peewee import *
from peewee import InsertQuery, UpdateQuery

db = MySQLDatabase('prof', user='root')
db.connect()

class TestModel(Model):
    data = CharField()

    class Meta:
        database = db

def main():
    TestModel.drop_table(True)
    TestModel.create_table()

    h = hpy()
    orig = num_records = 1000000
    every = 100000
    print 'Initial\n========'
    print h.heap()
    print '\n'

    while num_records:
        num_records -= 1
        data = str(random.randint(1, 100000))

        if TestModel.select().where(data=data).exists():
            InsertQuery(TestModel, data=data).execute()
        else:
            # this essential no-ops but we'll call update anyways
            UpdateQuery(TestModel, data=data).where(data=data).execute()

        if num_records % every == 0:
            print 'After %d records:\n=================' % (orig - num_records)
            print h.heap()
            print '\n\n'
            #if raw_input('Continue? Yn ') == 'n':
            #    break

if __name__ == '__main__':
    main()

And here is a link to the output: https://gist.github.com/efadb58837aff2b64224

Can you think of anything I'm missing here?

coleifer commented 12 years ago

Oh shit, realized I've got a bug... that first if block needs to be "if not" rather than "if". Will update after doing somemore testing.

coleifer commented 12 years ago

After changing it and about 100K records in still not seeing memory usage increasing.

prggmr commented 12 years ago

This may have already been addressed in a recent version since it can be something as simple as a missed circular reference somewhere ... once I get some time I will do some more analysis of everything and provide some more concrete data on the usage and a way to easily replicate if still possible.

I'm using 0.7 or 0.8 I believe ... downloaded back at the end of January.

prggmr commented 12 years ago

Forgot to close this ... again thanks for looking into to it and since it seems to not exist in the current version I may just need to update ... if time allows ...

coleifer commented 12 years ago

@nwhitingx yeah I am guessing there was a circular reference somewhere....never got to the number of records you were experiencing failure at, but also didn't see a steady increase in RAM usage. please update the ticket or open a new one if updating your checkout doesn't fix.

coleifer commented 12 years ago

Did you have any luck with a newer version, or did you still run into memory issues?

prggmr commented 12 years ago

I have not had the time to upgrade and test the newer version as I'm completely swamped with work right now once I get some time this is on my TODO list as I'm very interested in finding where this leak was occurring.

MRaoofnia commented 5 years ago

i have this problem too. I am using peewee 3.11.2 , PyMySQL 0.9.3 and python 3.6 i have 20m records in my database so , to reduce ram usage, i query for 100k records each time. the problem is that records are the same and must have the same size but each query causes more ram usage than the previous one.(e.g. first query uses 100MB of ram but the 30th one uses 5GB)

coleifer commented 5 years ago

Without code it's impossible to help. There are many ways to profile your memory usage.

Best recommendation is to use the .iterator() method of query to get a "one-shot" iterator that doesn't cache rows in memory.

coleifer / peewee

Memory usage Runs wild #73