capitalone / giraffez

User-friendly Teradata client for Python
https://capitalone.github.io/giraffez
Apache License 2.0
108 stars 31 forks source link

Load runs out of memory #73

Open istvan-fodor opened 5 years ago

istvan-fodor commented 5 years ago

giraffez version 2.0.24.2 Teradata Load Utility Version 16.20.00.09 64-Bit Ubuntu 16.04 4 cores, 16 GB Ram

If I run the giraffez load operation for large CSVs, the process runs out of memory. I see on the command line the usual message (Processed X Rows), and I see in top that memory usage is slowly creeping up. After a while the process maxes out on memory around 16 GB and the giraffez process is killed. Is this expected? Or is there a configuration I am missing?

istvan-fodor commented 5 years ago

I observed that its the same issue if I run through BulkLoad api with put()

hiker77 commented 5 years ago

My team ran into the same issue. I believe the leak is caused when iterating over elements in a tuple to convert. PySequence_GetItem is called which yields a new reference but only the last element in the list is getting deallocated. I ran a few tests using 1 million records, 2 integers each record, memory profiler snips are below: First Run

================================================
    10     22.5 MiB     22.5 MiB   @profile
    11                             def run():
    12     27.4 MiB      4.9 MiB       with giraffez.Cmd() as cmd:
    13     27.4 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.4 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.4 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.4 MiB      0.0 MiB           reader = csv.reader(f)
    17     93.6 MiB      0.3 MiB           for i, record in enumerate(reader):
    18     93.6 MiB      2.7 MiB               ld.put(record)
    19     93.6 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

And the second with the change

Line #    Mem usage    Increment   Line Contents
================================================
    10     22.7 MiB     22.7 MiB   @profile
    11                             def run():
    12     27.3 MiB      4.6 MiB       with giraffez.Cmd() as cmd:
    13     27.3 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.3 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.3 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.3 MiB      0.0 MiB           reader = csv.reader(f)
    17     31.7 MiB      0.0 MiB           for i, record in enumerate(reader):
    18     31.7 MiB      2.9 MiB               ld.put(record)
    19     31.7 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

The change can be seen in the commit below. Trying to figure this one out is my first dive into C/C++ since school so any feedback or alternative or better solutions would be appreciated. b1b9a50c9e8d89e4cedfdd0cbbdbd11977d66597

istvan-fodor commented 5 years ago

Thanks @hiker77 !