Closed spurrious closed 1 year ago
thanks for flagging this. is this the latest version? can you put in some fake data and send over an example? makes it slightly easier to reproduce.
Thank you for your prompt reply. The code below that generates some fake names data is enough to replicate our issue. After 10 loops, Python is demanding 100+ GB.
I am using version 0.9.6 on Python 3.11.3, on Windows 10.
import pandas as pd
from ethnicolr import pred_fl_reg_name
firstnames = ['Mary','Patricia','Jennifer','Linda','Elizabeth','Barbara','Susan','Jessica','Sarah','Karen','Lisa','Nancy','Betty','Margaret','Sandra','Ashley','Kimberly','Emily','Donna','Michelle','Carol','Amanda','Dorothy','Melissa','Deborah','Stephanie','Rebecca','Sharon','Laura','Cynthia','Kathleen','Amy','Angela','Shirley','Anna','Brenda','Pamela','Emma','Nicole','Helen','Samantha','Katherine','Christine','Debra','Rachel','Carolyn','Janet','Catherine','Maria','Heather','Diane','Ruth','Julie','Olivia','Joyce','Virginia','Victoria','Kelly','Lauren','Christina','Joan','Evelyn','Judith','Megan','Andrea','Cheryl','Hannah','Jacqueline','Martha','Gloria','Teresa','Ann','Sara','Madison','Frances','Kathryn','Janice','Jean','Abigail','Alice','Julia','Judy','Sophia','Grace','Denise','Amber','Doris','Marilyn','Danielle','Beverly','Isabella','Theresa','Diana','Natalie','Brittany','Charlotte','Marie','Kayla','Alexis','Lori']
lastnames = ['Smith','Johnson','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts']
data = [[]]
for fname in firstnames:
for lname in lastnames:
data = data + [[fname,lname]]
df = pd.DataFrame(data, columns=['firstname','lastname'])
for i in range(0, 1000):
odf = pred_fl_reg_name(df, 'lastname', 'firstname', conf_int=0.9)
```
Confirm memory leak when running with Python 3.11.x, and it appears to be caused by tensorflow (https://github.com/tensorflow/tensorflow/issues/60131)
No memory leak with ethnicolr 0.9.6 + Python 3.10.2 + tensoflow 2.12.0
import os
import psutil
import pandas as pd
from ethnicolr import pred_fl_reg_name
def get_process_memory():
process = psutil.Process(os.getpid())
return int(process.memory_info().rss / (1024*1024))
firstnames = ['Mary','Patricia','Jennifer','Linda','Elizabeth','Barbara','Susan','Jessica','Sarah','Karen','Lisa','Nancy','Betty','Margaret','Sandra','Ashley','Kimberly','Emily','Donna','Michelle','Carol','Amanda','Dorothy','Melissa','Deborah','Stephanie','Rebecca','Sharon','Laura','Cynthia','Kathleen','Amy','Angela','Shirley','Anna','Brenda','Pamela','Emma','Nicole','Helen','Samantha','Katherine','Christine','Debra','Rachel','Carolyn','Janet','Catherine','Maria','Heather','Diane','Ruth','Julie','Olivia','Joyce','Virginia','Victoria','Kelly','Lauren','Christina','Joan','Evelyn','Judith','Megan','Andrea','Cheryl','Hannah','Jacqueline','Martha','Gloria','Teresa','Ann','Sara','Madison','Frances','Kathryn','Janice','Jean','Abigail','Alice','Julia','Judy','Sophia','Grace','Denise','Amber','Doris','Marilyn','Danielle','Beverly','Isabella','Theresa','Diana','Natalie','Brittany','Charlotte','Marie','Kayla','Alexis','Lori']
lastnames = ['Smith','Johnson','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts','Williams','Brown','Jones','Garcia','Miller','Davis','Rodriguez','Martinez','Hernandez','Lopez','Gonzalez','Wilson','Anderson','Thomas','Taylor','Moore','Jackson','Martin','Lee','Perez','Thompson','White','Harris','Sanchez','Clark','Ramirez','Lewis','Robinson','Walker','Young','Allen','King','Wright','Scott','Torres','Nguyen','Hill','Flores','Green','Adams','Nelson','Baker','Hall','Rivera','Campbell','Mitchell','Carter','Roberts']
data = [[]]
for fname in firstnames:
for lname in lastnames:
data = data + [[fname,lname]]
df = pd.DataFrame(data, columns=['firstname','lastname'])
for i in range(0, 5):
print('Loop: %d Memory: %dMB' % (i, get_process_memory()))
odf = pred_fl_reg_name(df, 'lastname', 'firstname', conf_int=0.9)
Output:-
Loop: 0 Memory: 237MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 1 Memory: 284MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 2 Memory: 286MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 3 Memory: 286MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Loop: 4 Memory: 285MB
['asian', 'hispanic', 'nh_black', 'nh_white']
Hello, my colleagues and I are using ethnicolr and its functions on a large dataset of names (several million). We are experiencing a memory leak when running pred_fl_reg_name in a loop. The memory demand quickly balloons to over 256GB within a dozen iterations.
This simple loop is enough to recreate the issue. Any advice is appreciated. Thank you.
`import math import pandas as pd from ethnicolr import pred_fl_reg_name
data = pd.read_csv(namefile, iterator=True, chunksize=10000) for chunk in data: odf = pred_fl_reg_name(chunk, 'lastname', 'firstname', conf_int=0.9) odf.to_csv(savefile, mode='a', index=False, header=False)`