Closed prggmr closed 12 years ago
It definitely sounds like there's a problem somewhere. I'd like to know a little bit more before I set about trying to replicate -- what database engine were you using? Did you try dropping down to the db driver and issuing raw SQL queries, and if so did this fix the problem? Were you iterating over a large number of rows from the database?
Thanks for reporting, I'm eager to look into this.
I was using MySQL (MySQLdb).
Dropping peewee and using the MySQLdb itself with plain SQL queries and no ORM interaction has solved the issue.
As for the database table size it slowly progressed regardless of table size but it ultimately failed when running on a table with ~8 million records with a total record set for the entire database being around ~12 million at the point of failure, with each record running at a minimum of 2 queries per record, a select and then an insert or update depending on if the record exists.
From my previous experience the import would run roughly for 3 - 4 hours before consuming the servers entire memory of 4GB currently it is about to hit the 6 hour mark without using more than 0.3% of the servers memory which is the expected behavior.
Thanks for the info. I'll see about doing some profiling.
Initial results -- I am not able to replicate your issue. Using python2.7, MySQLdb, peewee 0.9.4. Wrote a little script using guppy to do memory profiling. It shows memory before starting querying after initial objects have been loaded up, then shows memory usage after every 100K iterations (it does not change). I was surprised by my results so I modified the script to add items to a dictionary in the loop and observed memory growing, so I know that its working right.
Here's my script:
import random
from guppy import hpy
from peewee import *
from peewee import InsertQuery, UpdateQuery
db = MySQLDatabase('prof', user='root')
db.connect()
class TestModel(Model):
data = CharField()
class Meta:
database = db
def main():
TestModel.drop_table(True)
TestModel.create_table()
h = hpy()
orig = num_records = 1000000
every = 100000
print 'Initial\n========'
print h.heap()
print '\n'
while num_records:
num_records -= 1
data = str(random.randint(1, 100000))
if TestModel.select().where(data=data).exists():
InsertQuery(TestModel, data=data).execute()
else:
# this essential no-ops but we'll call update anyways
UpdateQuery(TestModel, data=data).where(data=data).execute()
if num_records % every == 0:
print 'After %d records:\n=================' % (orig - num_records)
print h.heap()
print '\n\n'
#if raw_input('Continue? Yn ') == 'n':
# break
if __name__ == '__main__':
main()
And here is a link to the output: https://gist.github.com/efadb58837aff2b64224
Can you think of anything I'm missing here?
Oh shit, realized I've got a bug... that first if block needs to be "if not" rather than "if". Will update after doing somemore testing.
After changing it and about 100K records in still not seeing memory usage increasing.
This may have already been addressed in a recent version since it can be something as simple as a missed circular reference somewhere ... once I get some time I will do some more analysis of everything and provide some more concrete data on the usage and a way to easily replicate if still possible.
I'm using 0.7 or 0.8 I believe ... downloaded back at the end of January.
Forgot to close this ... again thanks for looking into to it and since it seems to not exist in the current version I may just need to update ... if time allows ...
@nwhitingx yeah I am guessing there was a circular reference somewhere....never got to the number of records you were experiencing failure at, but also didn't see a steady increase in RAM usage. please update the ticket or open a new one if updating your checkout doesn't fix.
Did you have any luck with a newer version, or did you still run into memory issues?
I have not had the time to upgrade and test the newer version as I'm completely swamped with work right now once I get some time this is on my TODO list as I'm very interested in finding where this leak was occurring.
i have this problem too. I am using peewee 3.11.2 , PyMySQL 0.9.3 and python 3.6 i have 20m records in my database so , to reduce ram usage, i query for 100k records each time. the problem is that records are the same and must have the same size but each query causes more ram usage than the previous one.(e.g. first query uses 100MB of ram but the 30th one uses 5GB)
Without code it's impossible to help. There are many ways to profile your memory usage.
Best recommendation is to use the .iterator()
method of query to get a "one-shot" iterator that doesn't cache rows in memory.
Hello,
I believe to have found a problem with the library causing excessive memory usage when used with long running processes, the problem I encountered is when importing a very large XML file (80+GB) the memory usage would climb at a very steady rate, after exhausting all other options to what was consuming the memory I removed the peewee library and the memory consumption stopped.
I cannot say exactly where the consumption is taking place but I was only using the
select
,update
andinsert
functions in the process, unfortunately I didn't have the time available to fully investigate what the problem is but it does exist somewhere.During the import I was using the following snippet for extracting the data.
https://gist.github.com/2161849
The code for running the queries was identical to the following