Open istvan-fodor opened 5 years ago
I observed that its the same issue if I run through BulkLoad api with put()
My team ran into the same issue. I believe the leak is caused when iterating over elements in a tuple to convert. PySequence_GetItem is called which yields a new reference but only the last element in the list is getting deallocated. I ran a few tests using 1 million records, 2 integers each record, memory profiler snips are below: First Run
================================================
10 22.5 MiB 22.5 MiB @profile
11 def run():
12 27.4 MiB 4.9 MiB with giraffez.Cmd() as cmd:
13 27.4 MiB 0.0 MiB cmd.execute("drop table creativedb.tom_test;")
14 27.4 MiB 0.0 MiB cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
15 27.4 MiB 0.0 MiB with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
16 27.4 MiB 0.0 MiB reader = csv.reader(f)
17 93.6 MiB 0.3 MiB for i, record in enumerate(reader):
18 93.6 MiB 2.7 MiB ld.put(record)
19 93.6 MiB 0.0 MiB print(f"\rRows Loaded: {i}", end='', flush=True)
And the second with the change
Line # Mem usage Increment Line Contents
================================================
10 22.7 MiB 22.7 MiB @profile
11 def run():
12 27.3 MiB 4.6 MiB with giraffez.Cmd() as cmd:
13 27.3 MiB 0.0 MiB cmd.execute("drop table creativedb.tom_test;")
14 27.3 MiB 0.0 MiB cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
15 27.3 MiB 0.0 MiB with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
16 27.3 MiB 0.0 MiB reader = csv.reader(f)
17 31.7 MiB 0.0 MiB for i, record in enumerate(reader):
18 31.7 MiB 2.9 MiB ld.put(record)
19 31.7 MiB 0.0 MiB print(f"\rRows Loaded: {i}", end='', flush=True)
The change can be seen in the commit below. Trying to figure this one out is my first dive into C/C++ since school so any feedback or alternative or better solutions would be appreciated. b1b9a50c9e8d89e4cedfdd0cbbdbd11977d66597
Thanks @hiker77 !
giraffez version 2.0.24.2 Teradata Load Utility Version 16.20.00.09 64-Bit Ubuntu 16.04 4 cores, 16 GB Ram
If I run the giraffez load operation for large CSVs, the process runs out of memory. I see on the command line the usual message (Processed X Rows), and I see in top that memory usage is slowly creeping up. After a while the process maxes out on memory around 16 GB and the giraffez process is killed. Is this expected? Or is there a configuration I am missing?