appeler / ethnicolr

Predict Race and Ethnicity Based on the Sequence of Characters in a Name
http://ethnicolr.readthedocs.io
MIT License
233 stars 65 forks source link

Memory leak #90

Closed spurrious closed 1 year ago

spurrious commented 1 year ago

Hello, my colleagues and I are using ethnicolr and its functions on a large dataset of names (several million). We are experiencing a memory leak when running pred_fl_reg_name in a loop. The memory demand quickly balloons to over 256GB within a dozen iterations.

This simple loop is enough to recreate the issue. Any advice is appreciated. Thank you.

`import math import pandas as pd from ethnicolr import pred_fl_reg_name

data = pd.read_csv(namefile, iterator=True, chunksize=10000) for chunk in data: odf = pred_fl_reg_name(chunk, 'lastname', 'firstname', conf_int=0.9) odf.to_csv(savefile, mode='a', index=False, header=False)`

soodoku commented 1 year ago

thanks for flagging this. is this the latest version? can you put in some fake data and send over an example? makes it slightly easier to reproduce.

spurrious commented 1 year ago

Thank you for your prompt reply. The code below that generates some fake names data is enough to replicate our issue. After 10 loops, Python is demanding 100+ GB.

I am using version 0.9.6 on Python 3.11.3, on Windows 10.


import pandas as pd
from ethnicolr import pred_fl_reg_name

firstnames = ['Mary','Patricia','Jennifer','Linda','Elizabeth','Barbara','Susan','Jessica','Sarah','Karen','Lisa','Nancy','Betty','Margaret','Sandra','Ashley','Kimberly','Emily','Donna','Michelle','Carol','Amanda','Dorothy','Melissa','Deborah','Stephanie','Rebecca','Sharon','Laura','Cynthia','Kathleen','Amy','Angela','Shirley','Anna','Brenda','Pamela','Emma','Nicole','Helen','Samantha','Katherine','Christine','Debra','Rachel','Carolyn','Janet','Catherine','Maria','Heather','Diane','Ruth','Julie','Olivia','Joyce','Virginia','Victoria','Kelly','Lauren','Christina','Joan','Evelyn','Judith','Megan','Andrea','Cheryl','Hannah','Jacqueline','Martha','Gloria','Teresa','Ann','Sara','Madison','Frances','Kathryn','Janice','Jean','Abigail','Alice','Julia','Judy','Sophia','Grace','Denise','Amber','Doris','Marilyn','Danielle','Beverly','Isabella','Theresa','Diana','Natalie','Brittany','Charlotte','Marie','Kayla','Alexis','Lori']
lastnames = ['Smith','Johnson','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts']

data = [[]]
for fname in firstnames:
    for lname in lastnames:
        data = data + [[fname,lname]]

df = pd.DataFrame(data, columns=['firstname','lastname'])

for i in range(0, 1000):
    odf = pred_fl_reg_name(df, 'lastname', 'firstname', conf_int=0.9)

```    
suriyan commented 1 year ago

Confirm memory leak when running with Python 3.11.x, and it appears to be caused by tensorflow (https://github.com/tensorflow/tensorflow/issues/60131)

No memory leak with ethnicolr 0.9.6 + Python 3.10.2 + tensoflow 2.12.0

import os
import psutil
import pandas as pd
from ethnicolr import pred_fl_reg_name

def get_process_memory():
    process = psutil.Process(os.getpid())
    return int(process.memory_info().rss / (1024*1024))

firstnames = ['Mary','Patricia','Jennifer','Linda','Elizabeth','Barbara','Susan','Jessica','Sarah','Karen','Lisa','Nancy','Betty','Margaret','Sandra','Ashley','Kimberly','Emily','Donna','Michelle','Carol','Amanda','Dorothy','Melissa','Deborah','Stephanie','Rebecca','Sharon','Laura','Cynthia','Kathleen','Amy','Angela','Shirley','Anna','Brenda','Pamela','Emma','Nicole','Helen','Samantha','Katherine','Christine','Debra','Rachel','Carolyn','Janet','Catherine','Maria','Heather','Diane','Ruth','Julie','Olivia','Joyce','Virginia','Victoria','Kelly','Lauren','Christina','Joan','Evelyn','Judith','Megan','Andrea','Cheryl','Hannah','Jacqueline','Martha','Gloria','Teresa','Ann','Sara','Madison','Frances','Kathryn','Janice','Jean','Abigail','Alice','Julia','Judy','Sophia','Grace','Denise','Amber','Doris','Marilyn','Danielle','Beverly','Isabella','Theresa','Diana','Natalie','Brittany','Charlotte','Marie','Kayla','Alexis','Lori']
lastnames = ['Smith','Johnson','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts']

data = [[]]
for fname in firstnames:
    for lname in lastnames:
        data = data + [[fname,lname]]

df = pd.DataFrame(data, columns=['firstname','lastname'])

for i in range(0, 5):
    print('Loop: %d Memory: %dMB' % (i, get_process_memory()))
    odf = pred_fl_reg_name(df, 'lastname', 'firstname', conf_int=0.9)

Output:-

Loop: 0 Memory: 237MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 1 Memory: 284MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 2 Memory: 286MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 3 Memory: 286MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 4 Memory: 285MB
['asian', 'hispanic', 'nh_black', 'nh_white']