deephaven / deephaven-core

Deephaven Community Core
Other
257 stars 80 forks source link

OOM on Write_Partitioned for S3 #6380

Open stanbrub opened 1 week ago

stanbrub commented 1 week ago

Using a very small amount of data, s3.write_partitioned crashes DHC on OOM. The below script should work in 24G Heap. However, in the DHC Code Studio, you can hover over the heap status and see memory usage increase rapidly. Running with a _rowcount of 2000 will crash DHC with an OOM.

Decreasing the number of unique values for the partitioned key mitigates the problem. So the issue appears to be more about, unique partition key values, or combinations of multiple partition keys, than number of rows.

import jpy
from deephaven import empty_table, garbage_collect
from deephaven.parquet import write_partitioned
from deephaven.experimental import s3

def print_heap():
    # garbage_collect()
    runtime = jpy.get_type('java.lang.Runtime').getRuntime()
    print('Heap Used MB:', (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024)

row_count = 1_000

print_heap()
print('Generate Table')
source = empty_table(row_count).update([
    'int10K=(ii % 10 == 0) ? null : ((int)(ii % 10000))',
    'short10K=(ii % 10 == 0) ? null : ((short)(ii % 10000))'
])
print_heap()
print('Partition By 1 Int Column')
source = source.partition_by(['int10K'])
print_heap()
print('S3 Write Partitioned')
write_partitioned(
    source, 's3://data/source.ptr.parquet', special_instructions=s3.S3Instructions(
        region_name='aws-global', endpoint_override='http://minio:9000',
        credentials=s3.Credentials.basic('minioadmin', 'minioadmin'),
        connection_timeout='PT20S'
    )
)
print_heap()