VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.2k stars 1.21k forks source link

vmagent: drop corrupted data blocks on start up #1030

Closed jelmd closed 3 years ago

jelmd commented 3 years ago

There seems to be a problem wrt. handling cached data by vmagent v1.52.0:

/usr/sbin/vmagent \
--remoteWrite.flushInterval=5s --httpListenAddr=127.0.0.1:8429 \
--promscrape.config=/data/vmagent/promscrape.json --promscrape.disableKeepAlive=false \
--promscrape.disableCompression --remoteWrite.url=http://vmdbA:8428/api/v1/write \
--remoteWrite.tmpDataPath=/data/vmagent
...
2021-01-22T15:06:38.344Z    info    VictoriaMetrics/app/vmagent/main.go:88  starting vmagent at "127.0.0.1:8429"...
2021-01-22T15:06:38.350Z    info    VictoriaMetrics/lib/memory/memory.go:43 limiting caches to 161940359577 bytes, leaving 107960239719 bytes to the OS according to -memory.allowedPercent=60
2021-01-22T15:06:38.350Z    error   VictoriaMetrics/lib/persistentqueue/persistentqueue.go:214  cannot read metainfo for persistent queue from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/metainfo.json": invalid data read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/metainfo.json": readerOffset=536871040 cannot exceed writerOffset=16701900; re-creating "/data/vmagent/persistent-queue/1_28103F80D27BCD31"
2021-01-22T15:06:38.364Z    info    VictoriaMetrics/lib/persistentqueue/fastqueue.go:51 opened fast persistent queue at "/data/vmagent/persistent-queue/1_28103F80D27BCD31" with maxInmemoryBlocks=200, it contains 0 pending bytes
2021-01-22T15:06:38.364Z    info    VictoriaMetrics/app/vmagent/remotewrite/client.go:130   initialized client for -remoteWrite.url="1:secret-url"
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/app/vmagent/main.go:111 started vmagent in 0.021 seconds
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/promscrape/scraper.go:91    reading Prometheus configs from "/data/vmagent/promscrape.json"
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:82 starting http server at http://127.0.0.1:8429/
2021-01-22T15:06:38.366Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:83 pprof handlers are exposed at http://127.0.0.1:8429/debug/pprof/
2021-01-22T15:06:38.371Z    info    VictoriaMetrics/lib/promscrape/scraper.go:344   static_configs: added targets: 3, removed targets: 0; total targets: 3
queue/persistentqueue.go:512    FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 8255741331277644361 bytes; cannot exceed 33554432 bytes
panic: FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 8255741331277644361 bytes; cannot exceed 33554432 bytes
...

The vmagent scrapes:

so 63720000 data pairs a day (~ 26 MB?).

hagen1778 commented 3 years ago

Hi @jelmd! The number of bytes 8255741331277644361 comes from a data block header, so it is likely a bug in vmagent (on flushing such blocks on disk) or data corruption. I'd encourage you to check if there was a probability of data corruption. But so far I anticipate it is a bug in vmagent. Thank you for report!

jelmd commented 3 years ago

Hi @hagen1778,

Well, the users on the related machines run experimental software which sometimes consumes all available memory (exactly to monitor this, I installed vmagent ;-)) and the machine needs to be rebooted, because no process can make any progress anymore. Not sure, what vmagent does in such situations, but at least I can say, that data series have gaps.

The fs used is ZFS, thus should be always consistent because of x-action based functionality.

hagen1778 commented 3 years ago

Are these "normal" kind of reboots (with SIGTERM to all processes) or "power-off" kind of reboots?

jelmd commented 3 years ago

ipmitool -I lanplus -H $host -U $admin -o supermicro chassis power reset

So more like hitting the reset button of the machine.

hagen1778 commented 3 years ago

Hm, while VM itself is resistant for power reset cases, the vmagent is not, unfortunately. It needs a proper shutdown procedure to finish on-disk writes. I guess, it may be marked as enhancement. Thanks for report!

jelmd commented 3 years ago

Ah, ok. Would be nice to have as much data as possible to be able to analyze "frozen" boxes like this. But getting it to busy, is bad as well ;-).

What is much more annoying is, that vmagent seems to never re-try unavailable prom scrape targets. E.g. if the machine gets rebooted and poor systemd starts vmagent before e.g. node-exporter, it seems to do nothing forever. So I need to restart vmagent manually to get any data again. But I guess, it is better to open another issue for it?

hagen1778 commented 3 years ago

Yes, please open another issue. Thanks!

jelmd commented 3 years ago

Hmm, it really, really, really sucks, that vmagent is not able to recover properly. It is of no use, if we have it to restart manually each time, after a reboot. Especially data collectors should be very robust to avoid loosing data. vmagent seems to be very fragile =:-((((

2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:12   build version: vmagent-v1.52.0 (v1.52.0)
2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:13   command line flags
2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "csvTrimTimestamp" = "1ms"
2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "dryRun" = "false"
2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "enableTCP6" = "false"
2021-02-22T17:09:32.928Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "envflag.enable" = "false"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "envflag.prefix" = ""
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "fs.disableMmap" = "false"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "graphiteListenAddr" = ""
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "graphiteTrimTimestamp" = "1s"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.connTimeout" = "2m0s"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.disableResponseCompression" = "false"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.idleConnTimeout" = "1m0s"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.maxGracefulShutdownDuration" = "7s"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.pathPrefix" = ""
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "http.shutdownDelay" = "0s"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "httpAuth.password" = "secret"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "httpAuth.username" = ""
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "httpListenAddr" = "127.0.0.1:8429"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "import.maxLineLen" = "104857600"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influx.maxLineSize" = "262144"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influxListenAddr" = ""
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influxMeasurementFieldSeparator" = "_"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influxSkipMeasurement" = "false"
2021-02-22T17:09:32.929Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influxSkipSingleField" = "false"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "influxTrimTimestamp" = "1ms"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "insert.maxQueueDuration" = "1m0s"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerDisableTimestamps" = "false"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerErrorsPerSecondLimit" = "0"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerFormat" = "default"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerLevel" = "INFO"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerOutput" = "stderr"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "loggerWarnsPerSecondLimit" = "0"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "maxConcurrentInserts" = "160"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "maxInsertRequestSize" = "33554432"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "memory.allowedBytes" = "0"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "memory.allowedPercent" = "60"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "metricsAuthKey" = "secret"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "opentsdbHTTPListenAddr" = ""
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "opentsdbListenAddr" = ""
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "opentsdbTrimTimestamp" = "1s"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "opentsdbhttp.maxInsertRequestSize" = "33554432"
2021-02-22T17:09:32.930Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "opentsdbhttpTrimTimestamp" = "1ms"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "pprofAuthKey" = "secret"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.config" = "/data/vmagent/promscrape.json"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.config.dryRun" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.config.strictParse" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.configCheckInterval" = "0s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.consulSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.disableCompression" = "true"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.disableKeepAlive" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.discovery.concurrency" = "100"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.discovery.concurrentWaitTime" = "1m0s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.dnsSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.dockerswarmSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.dropOriginalLabels" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.ec2SDCheckInterval" = "1m0s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.eurekaSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.fileSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.gceSDCheckInterval" = "1m0s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.kubernetesSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.maxDroppedTargets" = "1000"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.maxScrapeSize" = "16777216"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.openstackSDCheckInterval" = "30s"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.streamParse" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.suppressDuplicateScrapeTargetErrors" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "promscrape.suppressScrapeErrors" = "false"
2021-02-22T17:09:32.931Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.basicAuth.password" = "secret"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.basicAuth.username" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.bearerToken" = "secret"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.flushInterval" = "5s"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.label" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.maxBlockSize" = "8388608"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.maxDiskUsagePerURL" = "0"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.proxyURL" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.queues" = "4"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.relabelConfig" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.sendTimeout" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.showURL" = "false"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.significantFigures" = "0"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tlsCAFile" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tlsCertFile" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tlsInsecureSkipVerify" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tlsKeyFile" = "secret"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tlsServerName" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.tmpDataPath" = "/data/vmagent"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.url" = "secret"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "remoteWrite.urlRelabelConfig" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "tls" = "false"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "tlsCertFile" = ""
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "tlsKeyFile" = "secret"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/lib/logger/flag.go:20   flag "version" = "false"
2021-02-22T17:09:32.932Z    info    VictoriaMetrics/app/vmagent/main.go:88  starting vmagent at "127.0.0.1:8429"...
2021-02-22T17:09:32.935Z    info    VictoriaMetrics/lib/memory/memory.go:43 limiting caches to 161940359577 bytes, leaving 107960239719 bytes to the OS according to -memory.allowedPercent=60
2021-02-22T17:09:32.936Z    warn    VictoriaMetrics/lib/persistentqueue/persistentqueue.go:318                                                                                                                    "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000" size (6231028 bytes) is bigger than writer offset (6218989 bytes); this may be the case on unclean shutdown (OOM, `kill -9`, hardware reset); trying to fix it by adjusting fileSize to 6218989
2021-02-22T17:09:32.936Z    info    VictoriaMetrics/lib/persistentqueue/fastqueue.go:51                                                                                                                           opened fast persistent queue at "/data/vmagent/persistent-queue/1_28103F80D27BCD31" with maxInmemoryBlocks=200, it contains 12447 pending bytes
2021-02-22T17:09:32.937Z    info    VictoriaMetrics/app/vmagent/remotewrite/client.go:130                                                                                                                         initialized client for -remoteWrite.url="1:secret-url"
2021-02-22T17:09:32.938Z    info    VictoriaMetrics/app/vmagent/main.go:111 started vmagent in 0.006 seconds
2021-02-22T17:09:32.939Z    info    VictoriaMetrics/lib/promscrape/scraper.go:91                                                                                                                                  reading Prometheus configs from "/data/vmagent/promscrape.json"
2021-02-22T17:09:32.939Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:82                                                                                                                               starting http server at http://127.0.0.1:8429/
2021-02-22T17:09:32.939Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:83                                                                                                                               pprof handlers are exposed at http://127.0.0.1:8429/debug/pprof/
2021-02-22T17:09:32.944Z    info    VictoriaMetrics/lib/promscrape/scraper.go:344                                                                                                                                 static_configs: added targets: 3, removed targets: 0; total targets: 3
2021-02-22T17:09:32.966Z    error   VictoriaMetrics/app/vmagent/remotewrite/client.go:220                                                                                                                         couldn't send a block with size 12439 bytes to "1:secret-url": Post "http://mdbt:8428/api/v1/write": dial tcp4: lookup mdbt on 141.44.21.1:53: write udp 141.44.34.10:59438->141.44.21.1:53: write: operation not permitted; re-sending the block in 2.000 seconds
2021-02-22T17:09:33.225Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:35.225Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:37.225Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:39.226Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:235                                                                                                                              error when scraping "http://localhost:9400/metrics" from job "nvidia" with labels {gz="kino2",instance="localhost:9400",job="nvidia"}: error when scraping "http://localhost:9400/metrics": dial tcp4 127.0.0.1:9400: connect: connection refused; try -enableTCP6 command-line flag if you scrape ipv6 addresses
2021-02-22T17:09:42.943Z    panic   VictoriaMetrics/lib/persistentqueue/persistentqueue.go:512                                                                                                                    FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 1198702969903609204 bytes; cannot exceed 33554432 bytes
panic: FATAL: too big block size read from "/data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000": 1198702969903609204 bytes; cannot exceed 33554432 bytes

goroutine 59 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage(0xb11a68, 0x5, 0xc000cac000, 0xa1, 0x4)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:258 +0xc8c
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevelSkipframes(0x1, 0xb11a68, 0x5, 0xb15252, 0x9, 0xc0000a7d68, 0x1, 0x1)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:126 +0xd0
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(...)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:118
github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:114
github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue.(*Queue).MustReadBlock(0xc0000f0360, 0xc000292360, 0x0, 0x120, 0x0, 0x0, 0x0, 0xc0000c4000)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue/persistentqueue.go:512 +0x219
github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue.(*FastQueue).MustReadBlock(0xc0001a0000, 0xc000292360, 0x0, 0x120, 0x0, 0x0, 0x0, 0x0)
    github.com/VictoriaMetrics/VictoriaMetrics/lib/persistentqueue/fastqueue.go:169 +0x146
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*client).runWorker(0xc000022360)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:164 +0xa3
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.newClient.func1(0xc000022360)
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:127 +0x54
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.newClient
    github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/client.go:125 +0x9da
jelmd commented 3 years ago
admin.box2 ~ > ll /data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000
-rw-r--r-- 1 daemon daemon 6234577 Feb 22 18:09 /data/vmagent/persistent-queue/1_28103F80D27BCD31/0000000000000000
hagen1778 commented 3 years ago

As a workaround, have you tried to add a startup script which will drop vmagent queue after reboot?

jelmd commented 3 years ago

Hmm, not sure, what you mean. IMHO good apps handle errors and do not just panic and put all the burden on the user, which actually (at least in my case) has no clue, whether it is really time run away and panic as well ... ;-)

valyala commented 3 years ago

good apps handle errors and do not just panic and put all the burden on the user

Agreed. We need to figure out how to handle the error properly.

A temporary solution is to run rm -rf /data/vmagent/persistent-queue/* before starting vmagent after unclean shutdown as @hagen1778 pointed out.

valyala commented 3 years ago

The related issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/687

valyala commented 3 years ago

@jelmd , vmagent should recover when reading from corrupted persistent queue at -remoteWrite.tmpDataPath starting from the commit 95dbebf51214ae459537e8b11e14720a3a587784 . This commit will be included in the next release.

valyala commented 3 years ago

vmagent should automatically recover when reading from corrupted persistent queue starting from v1.58.0. Closing the issue as fixed.