aerospike / aerospike-client-python

Aerospike Python Client
Apache License 2.0
134 stars 111 forks source link

Possible memory leak in Python driver ordered map insert #142

Closed vegabook closed 7 years ago

vegabook commented 7 years ago

Refers to this Community board entry: https://discuss.aerospike.com/t/large-memory-usage-by-python-driver-on-big-batch-insert/3819

There appears to be growing memory usage, an order of magnitude greater than the space occupied by the data itself, by the Python driver when inserting into an ordered map. Here is a simplified version of the code from the Community forum link above. This will insert 100 million floats, batched by a ticker code and an integer, into ordered maps. The memory used by this piece of code grows enormously as the inserts progress. After 10 million inserts, AMC shows Aerospike test db using only 208MB, whereas this piece of code's memory footprint has grown to 1.5GB. And it continues upwards from there.

UPDATE: This appears only to affect the python 3 driver. Tested with version 2.0.6.

This code will work with both python 2.7 and 3.x. Just need to install numpy and psutil.



from __future__ import print_function
import aerospike
import numpy as np
from datetime import datetime
import os
import psutil   

if __name__ == "__main__":
    thisprocess = psutil.Process(os.getpid())
    config = {
        'hosts': [ ('localhost', 3000) ]
    }
    client = aerospike.client(config).connect()
    map_policy={'map_order':aerospike.MAP_KEY_VALUE_ORDERED}
    num_tickers = 1000 # number of individual securities
    num_periods = 24*30 # 
    total_entries = int(100e6)
    entries_per_period = int(total_entries / num_periods)
    this_period = 0
    timer = datetime.utcnow()
    counter = 0

    # set the map policies
    for period in range(num_periods):
        for ticker in range(num_tickers):
            key = ("test", "tickermap", str(ticker) + ":" + str(period))
            client.map_set_policy(key, "ticks", map_policy)

    # insert
    for period in range(num_periods):
        for entry in range(entries_per_period):
            ticker = str(np.random.choice(num_tickers)) + ":" + str(period)
            key = ("test", "tickermap", ticker)
            client.map_put(key, "ticks", this_period, np.random.uniform())
            this_period += 1
            counter += 1
            if(datetime.utcnow() - timer).seconds > 0:
                print(counter, "per second, total insertions:", this_period, " GB used:",\
                        thisprocess.memory_info().rss / 1e9)
                counter = 0
                timer = datetime.utcnow()
pygupta commented 7 years ago

I tested the above code up to 25 million inserts. I did not see the 10x increase in total memory. The memory consumed by the application code stays stable, while memory consumed by the server grows with the data, almost one to one.

Summary of test on Ubuntu 16.04 VM, running CE 3.11.0.1 1 - Boot up VM from shutdown, open terminal, start htop: Tot mem: 663M 2 - Start second terminal, launch aerospike single node server: 811M 3 - Start AMC: 838M 4 - Start firefox, launch AMC dashboard: 1.08G total, 0 Mb server 5 - Start above code (memleaktest.py) : 1.09G / 0Mb 6 - All 720K keys inserted - first code loop: 1.16G/62.9M 7 - 5 mill insert updates to maps: 1.36G/142M 8 - 10 mill insert updates to maps: 1.45G/224M (Steps 1 -8 were repeated twice - shutdown, reboot VM, similar data) 9 - 20 mill insert updates to maps: 1.63G/385M memleaktest.py mem usage remains at 1.2% to total memory throughout.

10 - 25 million insert updates to maps: 1.71G/466M

I have screen shots for all steps 1-9.

Step#5: screen shot 2017-02-02 at 8 39 43 pm

Step#6: screen shot 2017-02-02 at 8 40 45 pm

Step#7: (5 mill)

screen shot 2017-02-02 at 8 46 32 pm

Step#8: (10 mill) screen shot 2017-02-02 at 8 52 09 pm

Step #9: (20 mill) screen shot 2017-02-02 at 9 03 36 pm

Step#10 (25 million) screen shot 2017-02-02 at 9 08 48 pm

jboone100 commented 7 years ago

@pygupta : Were you running with python 2.x? This looks to be a problem with python 3.x due to the unicode conversion of the string in the key. I have located the leak and will have a patch to the client tomorrow.

pygupta commented 7 years ago

Duh... I missed the note in bold.. only affects Python 3. I ran with 2.7.1.2 $ python Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2

Yup, with python3, it just takes off!

screen shot 2017-02-02 at 9 46 57 pm

jboone100 commented 7 years ago

@vegabook: Can you try your program with the 2.0.7 client release?

pygupta commented 7 years ago

@jboone100 - does not look like it fixed. 1 - updated to 2.0.7

$ sudo python3 -m pip install aerospike -U
...
Collecting aerospike
  Downloading aerospike-2.0.7.tar.gz (102kB)
    100% |████████████████████████████████| 102kB 347kB/s 
Installing collected packages: aerospike
  Found existing installation: aerospike 2.0.6
    Uninstalling aerospike-2.0.6:
      Successfully uninstalled aerospike-2.0.6
  Running setup.py install for aerospike ... |
done
Successfully installed aerospike-2.0.7

(No AMC, only terminals, fresh reboot.) Check:

piyush@ubuntu:~/discussRepo/pyTestMemLeak$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import aerospike
>>> print(aerospike.__version__)
2.0.7
>>> 

Then, started aerospike server and ran python3 memleaktest.py. 1 - After initiating all keys, total mem = 0.864 Gb, server = 62 Mb 2 - After 5 million update inserts: total mem = 1.14Gb, server = 143 Mb

% mem used by python3 memleaktest.py increases steadily, upto 12% by 5 million inserts. Step 1 - screenshot: screen shot 2017-02-03 at 1 39 56 pm Step 2 - screenshot: screen shot 2017-02-03 at 1 46 13 pm

jboone100 commented 7 years ago

Can you try running leaks <pid> on your process while it is running. It was pretty clear from the output of that tool that the 2.0.6 client was leaking and 2.0.7 isn't.

pygupta commented 7 years ago

@jboone100 I don't know "leaks" - on Ubuntu?.. but here is RSS growing:

piyush@ubuntu:~$ leaks 5161
leaks: command not found
piyush@ubuntu:~$ ps -p 5161 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5161 233564 11.5 python3 memleaktest.py
piyush@ubuntu:~$ ps -p 5161 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5161 237916 11.7 python3 memleaktest.py
piyush@ubuntu:~$ ps -p 5161 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5161 239708 11.8 python3 memleaktest.py
piyush@ubuntu:~$ ps -p 5161 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5161 242780 11.9 python3 memleaktest.py
piyush@ubuntu:~$ ps -p 5161 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5161 245852 12.1 python3 memleaktest.py

RSS tracking is what has been also coded in the application using psutil - and that keeps increasing. With python 2.7 it was stable, same value. See earlier screeshots from my first post. (The 0.023... number.) RSS : (real memory size or resident set size in 1024 byte units)

Double check with python 2.7, pretty rock solid.

piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
piyush@ubuntu:~$ ps -p 5599 -o pid -o rss -o %mem -o cmd
   PID   RSS %MEM CMD
  5599 28204  1.3 python memleaktest.py
vegabook commented 7 years ago

I can confirm that my 2.0.7 driver is also still leaking memory on Python 3. screenshot from 2017-02-04 13-23-58

After 10 million inserts: screenshot from 2017-02-04 13-29-21

Using 2.0.7: screenshot from 2017-02-04 13-30-39

jboone100 commented 7 years ago

Fixed in the 2.0.8 release.

pygupta commented 7 years ago

Fix validated. No more memory leaks with python3.

13602 per second, total insertions: 3515973  GB used: 0.028585984
13558 per second, total insertions: 3529531  GB used: 0.028585984
13899 per second, total insertions: 3543430  GB used: 0.028585984
14112 per second, total insertions: 3557542  GB used: 0.028585984
12829 per second, total insertions: 3570371  GB used: 0.028585984
13347 per second, total insertions: 3583718  GB used: 0.028585984
13139 per second, total insertions: 3596857  GB used: 0.028585984

screen shot 2017-02-10 at 7 37 35 pm

vegabook commented 7 years ago

yep. Thanks for your efforts on this.

image