exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
13.96k stars 498 forks source link

Unsustainable memory consumption #472

Open uditsharma29 opened 9 months ago

uditsharma29 commented 9 months ago

Hello,

While working with real-world data with a couple hundred million rows, Python is able to run on about 50GB memory consumption as per the Mac's activity monitor. On the other hand, Codon uses up more than 80 GB, slows down incredibly and breaks at the 25% mark. Similar memory consumption was observed even when I implemented further batching indicating that the memory consumed in prior batches is not being released even after the batch is done processing. I also tried to use the 'del' keyword as well as tried to gc.collect() ASAP but there is no improvement.

How does Codon manage memory? What should I be doing to explicitly release memory from variables which are no longer needed?

Please let me know if you need anything from my end to replicate the issue.

arshajii commented 9 months ago

Hi @uditsharma29, thanks for the report. Can you please provide a bit more info as to what you're doing? Are you by any chance creating CPython objects in the main loop? That's the source of another issue so I'm wondering if it's the same here.

uditsharma29 commented 9 months ago

Hi @arshajii , thanks for the response. Here is what my code does (pseudocode):

1) Read a file (parquet) consisting of serialized tree objects. (2-3 Million individual graphs) 2) Deserialize to create the graphs. 3) read another file (csv), the contents of this file will be used to update the graphs appropriately. (10 Million per day) 4) Given contents of both files as input, the graphs are updated. 5) Serialize and save to file (parquet) again.

The above is supposed to happen daily with new 10 million data. for 30 days. Codon is unable to deal with the data post about 7 days when it reaches a memory consumption of 80+ GB. Here is the abstracted code for you to understand the working:

for date in date_range:
    ip_dict: dict[str, Record]
    input_file_names: list[str]
    output_file_name: str
    f_data: list[list[str]]
    inter_f_data: list[list[str]]
    print("Processing date: "+date)
    input_file_names = ["merged_graphs_hash_"+str(num_batches)+"_"+str(batch_num)+"_20230701.parquet", "source_data_batch_"+str(batch_num)+"_"+date+".csv"]
    output_file_name = "merged_graphs_hash_"+str(num_batches)+"_"+str(batch_num)+"_20230701.parquet"

    data: list[list[str]]
    if len(input_file_names) > 1:
        temp_f_data = read_serialized_file(input_file_names[0])
        data = read_csv_from_gcs(input_file_names[1])
    else:
        data = read_csv_from_gcs(input_file_names[0])
    if len(input_file_names) > 1:
        des_f_data = deserialize_file_contents(temp_f_data)
        ip_dict = create_tree_from_deserialized_data(des_f_data)
        del des_f_data
        gc.collect()
        ip_dict = create_map(data, ip_dict)
    else:
        ip_dict = create_map(data, {})  
    del data
    gc.collect()
    f_data = create_final_data(ip_dict)
    del ip_dict
    gc.collect()
    #print("Writing to parquet file")
    write_parquet_file(output_file_name, convert_to_dataframe(f_data))
    del f_data
    gc.collect()

I am not sure what you mean by creating CPython objects in the main loop. In this case, even if the processing happens in the functions, the entire data does go through the main loop at some point. I don't see a different way to avoid passing data through the main loop.

Please let me know if you need more information to understand the issue further.

arshajii commented 9 months ago

Thanks for the follow-up -- out of curiosity can you try running with the environment variable GC_MAXIMUM_HEAP_SIZE set to 50G (or whatever would be a reasonable heap size)?

By CPython object, I meant if you had any from python imports in the code that you were using in the main loop.

uditsharma29 commented 9 months ago

There is little to no improvement when using the GC_MAXIMUM_HEAP_SIZE env variable. Just in case it matters, I specified the GC_MAXIMUM_HEAP_SIZE in an @python block as os.environ['GC_MAXIMUM_HEAP_SIZE'] = str(heap_size_bytes) resulted in an error in Codon. I hope it still works as expected.

I am not using any from python imports in my code.

uditsharma29 commented 9 months ago

Clarification on the CPython objects, We are using functions with @python decorators for I/O operations i.e. reading the files initially and writing (updating) the files at the end of processing. There are no CPython objects during processing. Below is a pseudocode for the workflow -

@python
def read():
    # read file data
    data = open("file.csv")
    return data

def process(data):
    # process the data
    return processed_data

@python
def write(processed_data):
    # write the data
    write("output.csv", processed_data)

def main():
    data = read()
    processed_data = process(data)
    write(processed_data)   

In this case, since the read() has the @python decorator, is data considered a CPython object?