ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.81k stars 220 forks source link

Memory usage slowly climbs up (suspected memory leak) #691

Closed hazelnut-99 closed 2 weeks ago

hazelnut-99 commented 4 months ago

After launching a pipeline, the worker pod's memory usage slowly increases

image

below is corresponding cpu usage

image

This pipeline uses kafka source and kafka sink. During some time intervals, the source kafka topic has no traffic, aligning with the low cpu usage periods. However, the memory usage slowly increases monotonically.

Could it be a memory leak issue?

Version: ghcr.io/arroyosystems/arroyo:tip Configuration & environment: kubernetes scheduler, a distributed kubernetes cluster Query

CREATE TABLE my_kafka_sink (
    field_1 TEXT,
    field_2 TEXT,
    field_3 DOUBLE,
    ts_1 BIGINT,
    ts_2 BIGINT
) WITH (
    'connector' = 'kafka',
    'avro.confluent_schema_registry' = 'true',
    'bootstrap_servers' = 'my_servers',
    'schema_registry.endpoint' = 'my_endpoint',
    'type' = 'sink',
    'topic' = 'my_topic',
    'format' = 'avro'
);

INSERT INTO my_kafka_sink 
select table_1.field_1, table_1.field_2, 
    (table_2.field_y - table_1.field_z) as field_3,
    table_1.ts as ts_1, table_2.ts as ts_2
from
(
    select field_1, field_2, hop(INTERVAL '5' second, INTERVAL '3' minute) as window, 
    last_value(field_z) as field_z, last_value("timestampMillis") as ts
    from connection_1
    group by 1, 2, 3
) as table_1 join 
(
    select field_1, field_2, hop(INTERVAL '5' second, INTERVAL '3' minute) as window, 
    last_value(value_latest) as field_y, last_value("timestampMillis") as ts
    from connection_2
    group by 1, 2, 3
) as table_2
on table_1.field_1 = table_2.field_1 and 
    table_1.field_2 = table_2.field_2 and 
    table_1.window = table_2.window
hazelnut-99 commented 4 months ago

below is a snapshot of the memory.stat under cgroup

cache 1368399872
rss 421425152
rss_huge 392167424
shmem 1367359488
mapped_file 458752
dirty 0
writeback 0
swap 0
pgpgin 488883
pgpgout 149882
pgfault 105704
pgmajfault 8
inactive_anon 1793511424
active_anon 20480
inactive_file 954368
active_file 86016
unevictable 0
hierarchical_memory_limit 8589934592
hierarchical_memsw_limit 8589934592
total_cache 1368399872
total_rss 421425152
total_rss_huge 392167424
total_shmem 1367359488
total_mapped_file 458752
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 488883
total_pgpgout 149882
total_pgfault 105704
total_pgmajfault 8
total_inactive_anon 1793511424
total_active_anon 20480
total_inactive_file 954368
total_active_file 86016
total_unevictable 0

/proc/meminfo

MemTotal:       527163008 kB
MemFree:        105654428 kB
MemAvailable:   322815688 kB
Buffers:            5260 kB
Cached:         210483576 kB
SwapCached:            0 kB
Active:         16759100 kB
Inactive:       376572732 kB
Active(anon):      83636 kB
Inactive(anon): 184074852 kB
Active(file):   16675464 kB
Inactive(file): 192497880 kB
Unevictable:       22404 kB
Mlocked:           22404 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              4988 kB
Writeback:             0 kB
AnonPages:      181540868 kB
Mapped:          2901992 kB
Shmem:           1418116 kB
KReclaimable:   11483240 kB
Slab:           18266620 kB
SReclaimable:   11483240 kB
SUnreclaim:      6783380 kB
KernelStack:      397488 kB
PageTables:       741140 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    263581504 kB
Committed_AS:   612102016 kB
VmallocTotal:   13743895347199 kB
VmallocUsed:      538176 kB
VmallocChunk:          0 kB
Percpu:          4529952 kB
HardwareCorrupted:     0 kB
AnonHugePages:  118398976 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     2514484 kB
DirectMap2M:    395501568 kB
DirectMap1G:    138412032 kB
hazelnut-99 commented 4 months ago

Controller's memory usage slowly climbs up as well.

image
mrchypark commented 3 months ago

any update here? I have same issue.

mwylde commented 3 months ago

Thanks for the reports, we're investigating.

mwylde commented 3 months ago

We've merged a fix for a memory leak in the worker with certain queries with #717. Still looking into issues in the controller but nothing definitive there.

mwylde commented 2 weeks ago

This appears to be fixed, but please re-open if you continue to see a problem with memory growth. We've also added some helpful memory profiling tools which can help debug.