Open uditsharma29 opened 1 year ago
Hi @uditsharma29, thanks for the report. Can you please provide a bit more info as to what you're doing? Are you by any chance creating CPython objects in the main loop? That's the source of another issue so I'm wondering if it's the same here.
Hi @arshajii , thanks for the response. Here is what my code does (pseudocode):
1) Read a file (parquet) consisting of serialized tree objects. (2-3 Million individual graphs) 2) Deserialize to create the graphs. 3) read another file (csv), the contents of this file will be used to update the graphs appropriately. (10 Million per day) 4) Given contents of both files as input, the graphs are updated. 5) Serialize and save to file (parquet) again.
The above is supposed to happen daily with new 10 million data. for 30 days. Codon is unable to deal with the data post about 7 days when it reaches a memory consumption of 80+ GB. Here is the abstracted code for you to understand the working:
for date in date_range:
ip_dict: dict[str, Record]
input_file_names: list[str]
output_file_name: str
f_data: list[list[str]]
inter_f_data: list[list[str]]
print("Processing date: "+date)
input_file_names = ["merged_graphs_hash_"+str(num_batches)+"_"+str(batch_num)+"_20230701.parquet", "source_data_batch_"+str(batch_num)+"_"+date+".csv"]
output_file_name = "merged_graphs_hash_"+str(num_batches)+"_"+str(batch_num)+"_20230701.parquet"
data: list[list[str]]
if len(input_file_names) > 1:
temp_f_data = read_serialized_file(input_file_names[0])
data = read_csv_from_gcs(input_file_names[1])
else:
data = read_csv_from_gcs(input_file_names[0])
if len(input_file_names) > 1:
des_f_data = deserialize_file_contents(temp_f_data)
ip_dict = create_tree_from_deserialized_data(des_f_data)
del des_f_data
gc.collect()
ip_dict = create_map(data, ip_dict)
else:
ip_dict = create_map(data, {})
del data
gc.collect()
f_data = create_final_data(ip_dict)
del ip_dict
gc.collect()
#print("Writing to parquet file")
write_parquet_file(output_file_name, convert_to_dataframe(f_data))
del f_data
gc.collect()
I am not sure what you mean by creating CPython objects in the main loop. In this case, even if the processing happens in the functions, the entire data does go through the main loop at some point. I don't see a different way to avoid passing data through the main loop.
Please let me know if you need more information to understand the issue further.
Thanks for the follow-up -- out of curiosity can you try running with the environment variable GC_MAXIMUM_HEAP_SIZE
set to 50G
(or whatever would be a reasonable heap size)?
By CPython object, I meant if you had any from python import
s in the code that you were using in the main loop.
There is little to no improvement when using the GC_MAXIMUM_HEAP_SIZE env variable. Just in case it matters, I specified the GC_MAXIMUM_HEAP_SIZE in an @python block as os.environ['GC_MAXIMUM_HEAP_SIZE'] = str(heap_size_bytes)
resulted in an error in Codon. I hope it still works as expected.
I am not using any from python import
s in my code.
Clarification on the CPython objects, We are using functions with @python decorators for I/O operations i.e. reading the files initially and writing (updating) the files at the end of processing. There are no CPython objects during processing. Below is a pseudocode for the workflow -
@python
def read():
# read file data
data = open("file.csv")
return data
def process(data):
# process the data
return processed_data
@python
def write(processed_data):
# write the data
write("output.csv", processed_data)
def main():
data = read()
processed_data = process(data)
write(processed_data)
In this case, since the read()
has the @python decorator, is data
considered a CPython object?
Hello,
While working with real-world data with a couple hundred million rows, Python is able to run on about 50GB memory consumption as per the Mac's activity monitor. On the other hand, Codon uses up more than 80 GB, slows down incredibly and breaks at the 25% mark. Similar memory consumption was observed even when I implemented further batching indicating that the memory consumed in prior batches is not being released even after the batch is done processing. I also tried to use the 'del' keyword as well as tried to gc.collect() ASAP but there is no improvement.
How does Codon manage memory? What should I be doing to explicitly release memory from variables which are no longer needed?
Please let me know if you need anything from my end to replicate the issue.