Attempts to configure VM for small memory footprint don't yield expected results

aprospero commented 6 months ago

Is your question request related to a specific component?

VictoriaMetrics

Describe the question in detail

Abstract

I am evaluating VM on a embedded device with limited resources. My ultimate goal is to only allocate 20-30MiB RAM to VM.

Setup

Platform:

armv7 single core
512 MB RAM
1GB Flash

OS:

yocto dunfell
kernel: 5.4.219
go runtime 1.4

VM Version:

victoria-metrics-20240301-013527-tags-v1.99.0-0-g9cd4b0537

Test data:

1.5 month
160 series
5min sample rate
retention period 5 month

Test setup:

no ingestion
single queries for varying series over varying timespans
OpenTSDB over HTTP via query_range API and raw data as json line protocol via export API (both yield similr results)

Command line flags:

varying, always in conjunction with -opentsdbHTTPListenAddr=:4242 -retentionPeriod=5
-memory.allowedPercent=5
-memory.allowedBytes=30MiB
-search.maxMemoryPerQuery=4MiB -search.maxConcurrentRequests=1

Also tried but only sporadic:

-http.disableResponseCompression
-internStringDisableCache
-loggerLevel="PANIC"
-prevCacheRemovalPercent=0.8
-search.maxConcurrentRequests=1
-search.maxExportSeries=1000
-search.queryStats.lastQueriesCount=0
-search.maxWorkersPerQuery=1
-search.maxUniqueTimeseries=200
-search.maxTSDBStatusSeries=1

Regardless of the combination of listed command line flags the result is always pretty much the same (see below).

Observed behavior

the RSS page count allocated by vm starts after startup at around 60MB (which is already way more than expected) and begins to continually rise when proceeding with the benchmark. This behaviour goes on until the system runs out of free memory pages and the kernel kills the vm process.

A typical VM startup log looks like this:

 /usr/bin/vm -opentsdbHTTPListenAddr=:4242 -retentionPeriod=5 -memory.allowedBytes=30MiB -search.maxConcurrentRequests=1 -search.maxMemoryPerQuery=4MiB
2024-05-15T10:13:54.082Z    info    VictoriaMetrics/lib/logger/flag.go:12   build version: victoria-metrics-20240301-013527-tags-v1.99.0-0-g9cd4b0537
2024-05-15T10:13:54.085Z    info    VictoriaMetrics/lib/logger/flag.go:13   command-line flags
2024-05-15T10:13:54.089Z    info    VictoriaMetrics/lib/logger/flag.go:20     -memory.allowedBytes="30MiB"
2024-05-15T10:13:54.091Z    info    VictoriaMetrics/lib/logger/flag.go:20     -opentsdbHTTPListenAddr=":4242"
2024-05-15T10:13:54.095Z    info    VictoriaMetrics/lib/logger/flag.go:20     -retentionPeriod="5"
2024-05-15T10:13:54.097Z    info    VictoriaMetrics/lib/logger/flag.go:20     -search.maxConcurrentRequests="1"
2024-05-15T10:13:54.098Z    info    VictoriaMetrics/lib/logger/flag.go:20     -search.maxMemoryPerQuery="4MiB"
2024-05-15T10:13:54.099Z    info    VictoriaMetrics/app/victoria-metrics/main.go:73 starting VictoriaMetrics at "[:8428]"...
2024-05-15T10:13:54.101Z    info    VictoriaMetrics/app/vmstorage/main.go:106   opening storage at "victoria-metrics-data" with -retentionPeriod=5
2024-05-15T10:13:54.130Z    info    VictoriaMetrics/lib/memory/memory.go:46 limiting caches to 31457280 bytes, leaving 492752896 bytes to the OS according to -memory.allowedBytes=30MiB
2024-05-15T10:13:55.855Z    info    VictoriaMetrics/lib/storage/storage.go:958  discarding /mnt/data/fld-prototype/victoria-metrics-data/cache/curr_hour_metric_ids, since it contains outdated hour; got 476583; want 476602
2024-05-15T10:13:55.859Z    info    VictoriaMetrics/lib/storage/storage.go:958  discarding /mnt/data/fld-prototype/victoria-metrics-data/cache/prev_hour_metric_ids, since it contains outdated hour; got 476582; want 476601
2024-05-15T10:13:56.138Z    info    VictoriaMetrics/lib/storage/storage.go:919  discarding /mnt/data/fld-prototype/victoria-metrics-data/cache/next_day_metric_ids_v2, since it contains data for stale date; got 19857; want 19858
2024-05-15T10:13:56.834Z    info    VictoriaMetrics/app/vmstorage/main.go:120   successfully opened storage "victoria-metrics-data" in 2.731 seconds; partsCount: 34; blocksCount: 4998; rowsCount: 1817056; sizeBytes: 1351094
2024-05-15T10:13:56.852Z    info    VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:126  loading rollupResult cache from "victoria-metrics-data/cache/rollupResult"...
2024-05-15T10:13:58.504Z    info    VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:155  loaded rollupResult cache from "victoria-metrics-data/cache/rollupResult" in 1.644 seconds; entriesCount: 459, sizeBytes: 20119552
2024-05-15T10:13:58.508Z    info    VictoriaMetrics/lib/ingestserver/opentsdbhttp/server.go:35  starting HTTP OpenTSDB server at ":4242"
2024-05-15T10:13:58.516Z    info    VictoriaMetrics/app/victoria-metrics/main.go:84 started VictoriaMetrics in 4.415 seconds
2024-05-15T10:13:58.526Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:118    starting server at http://127.0.0.1:8428/
2024-05-15T10:13:58.528Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:119    pprof handlers are exposed at http://127.0.0.1:8428/debug/pprof/
2024/05/15 10:30:33 ERROR: metrics: cannot read process_io_* metrics from "/proc/self/io", so these metrics won't be updated until the error is fixed; see https://github.com/VictoriaMetrics/metrics/issues/42 ; The error: open /proc/self/io: no such file or directory
Killed

The typical output of our Benchmark looks like this:

VictoriaMetrics Benchmark
The system clock ticks at 0.001 µs, steadiness false. The steady clock ticks at 0.001 µs.

10000 Queries for   1 measurements over     5 minutes took (ms) min/avg/max:     2.63/    7.87/  496.86, median:     6.80, standard deviation:     6.73. Resultcount was (pts) min/avg/max:        0/      0.96/       1, median:        1, standard deviation:     0.20.
10000 Queries for   1 measurements over    60 minutes took (ms) min/avg/max:     2.89/    8.19/   66.47, median:     7.15, standard deviation:     4.24. Resultcount was (pts) min/avg/max:        0/     11.54/      17, median:       12, standard deviation:     2.41.
10000 Queries for   1 measurements over  1440 minutes took (ms) min/avg/max:     3.32/   15.31/  153.49, median:    13.12, standard deviation:     8.42. Resultcount was (pts) min/avg/max:        0/    275.96/     333, median:      288, standard deviation:    58.26.
  100 Queries for   1 measurements over 86400 minutes took (ms) min/avg/max:   527.22/ 1109.86/ 1741.76, median:  1107.97, standard deviation:   220.34. Resultcount was (pts) min/avg/max:        1/  11438.42/   12778, median:    11771, standard deviation:  2015.65.
10000 Queries for   4 measurements over     5 minutes took (ms) min/avg/max:     8.77/   23.99/  147.99, median:    19.02, standard deviation:    14.52. Resultcount was (pts) min/avg/max:        1/      3.83/       4, median:        4, standard deviation:     0.40.
10000 Queries for   4 measurements over    60 minutes took (ms) min/avg/max:     9.22/   27.19/  233.38, median:    21.39, standard deviation:    16.00. Resultcount was (pts) min/avg/max:       12/     45.90/      53, median:       48, standard deviation:     5.08.
 5871 - 1152Request error: (1) Failed to connect to localhost port 8428: Connection refused

The Benchmark starts with tiny queries for only one series over the minimum timespan of 5min. It then raises the timespan and series count step by step.

Phase	RSS
startup	59 MByte
after 10.000 Queries 1 series over last 5 min	64 MByte
after 10.000 Queries 1 series over last 1 hour	67 MByte
after 10.000 Queries 1 series over last 1 day	78 MByte
after 100 Queries 1 series over last 1.5 month¹	87 MByte
after 10.000 Queries 4 series over last 5 min	128 MByte
after 10.000 Queries 4 series over last 1 hour	168 MByte
after 10.000 Queries 4 series over last 1 day	178 MByte

¹ divided in 45 smaller consecutive 1 day queries

Observed VM Metrics

I was asked for the following metrics to add to the issue description. I'm happy to provide more if necessary.	Metric	Value
vm_allowed_memory_bytes	31457280
vm_available_memory_bytes	524210176

Expected Behaviour

I understand that the real memory consumption does not alone depend on the provided command line flags but what definitely was unxepected to see was the RSS raising indefinitely until no free memory pages are available anymore.

I have expected VM would

reject queries that can't be handled with the provided memory, or
garbage collect / clear caches when reaching a certain memory allocation.

Further comments

I even tried disabling the Cache with -search.disableCache, but even that didn't change anything in VMs behaviour memory wise, despite the query duration went up in average.
I'm not familiar with go programming and go runtime behaviour - although I read about the greedy allocation scheme. I tried a run with environment variables set GOMEMLIMIT=60MiB and GOGC=100 but again to no vavail, VM behaviour was again the same.

Epilog

I'm out of ideas how to tame VM regarding memory consumption. I'd say I don't expect too much from it limiting all queries to 1day timespans and only a hand full of series. Even the biggest queries result in around 300 data points.

If anyone has an idea or even a comment that maybe I'm in vain since it won't run with that fistful of RAM is appreciated!

Troubleshooting docs

[X] General - https://docs.victoriametrics.com/troubleshooting/
[ ] vmagent - https://docs.victoriametrics.com/vmagent/#troubleshooting
[ ] vmalert - https://docs.victoriametrics.com/vmalert/#troubleshooting

AndrewChubatiuk commented 6 months ago

hey @aprospero Thanks for a question Could you please share a memory profile?

aprospero commented 6 months ago

Hey AndrewChubatiuk, thanks for having a look into it!

Could you specify more in detail what you need? Do you mean a certain VM profiler info? I'm not familiar with golang, so how can I extract that info?

aprospero commented 5 months ago

@AndrewChubatiuk Bump

VictoriaMetrics / VictoriaMetrics