VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
10.97k stars 1.1k forks source link

Problem with null values #6198

Open perederyaev opened 3 weeks ago

perederyaev commented 3 weeks ago

Describe the bug

Hi, VictoriaMetrics v1.93.14 processes samples with null values in way. When VM receives metrics like

{"metric":{"__name__":"namedprocess_namegroup_thread_context_switches_total","job":"process_exporter","instance":"srv82","ctxswitchtype":"nonvoluntary","groupname":"mtpro - php  -dmemory_limit=640M run.php --controller=TimeSpent\\RawLowConsumer --action=default --useSwoole=1 --pid=1858 --maxWorkTime=3600 --timeout=5000 --partition=278","threadname":"grpc_global_tim"},"values":[null,null,null,null,null,null,null,null],"timestamps":[1714159866695,1714159926695,1714159986694,1714160046694,1714160106695,1714160166694,1714160226695,1714160286695]}
{"metric":{"__name__":"namedprocess_namegroup_thread_context_switches_total","job":"process_exporter","instance":"srv82","ctxswitchtype":"voluntary","groupname":"mtpro - php  -dmemory_limit=640M run.php --controller=TimeSpent\\RawLowConsumer --action=default --useSwoole=1 --pid=1858 --maxWorkTime=3600 --timeout=5000 --partition=335","threadname":"rdk:broker11111"},"values":[null,null,null,null,null,null,null,null],"timestamps":[1714159866695,1714159926695,1714159986694,1714160046694,1714160106695,1714160166694,1714160226695,1714160286695]}

it doesn't insert them (can't see them in export or cardinality explorer in VMUI), doesn't count them in active series but counts them as 'slow inserts' and 'cache misses' (storage/tsid). This is critical issue because it affects performance of the whole ingesting.

Same metrics are processed correctly in v1.87.14 - they are counted in active series, we can access them and no strange 'slow inserts', 'cache misses' and so on.

To Reproduce

Try to insert null values in clean database (so no metrics exists before) to VM v1.93.14

Version

victoria-metrics-20240419-095826-tags-v1.93.14-0-g345a53d8b0

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

Haleygo commented 3 weeks ago

Hello, Yes, VictoriaMetrics will drop "null" value from /api/v1/import when store https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5e8c087d4244a4d82e11c1428e9699d2a00b6cb7/lib/storage/storage.go#L1811-L1816

But as I can see, this behavior hasn't changed since /api/v1/import added support to ingest value like null in v1.82.0. https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5e8c087d4244a4d82e11c1428e9699d2a00b6cb7/lib/protoparser/vmimport/parser.go#L142-L143

So if you import timeseries with all "null" values like your example above, the whole timeseries will be dropped and they won't be counted in vm_slow_row_inserts_total since they're not inserted. But if you import timeseries with part of "null" values like "values":[3,null,13], the timeseries will be registered and not be marked as slow_insert next time.

Same metrics are processed correctly in v1.87.14 - they are counted in active series, we can access them and no strange 'slow inserts', 'cache misses' and so on.

Did you ingest the same values to v1.87.14, set all the values to "null"?

perederyaev commented 3 weeks ago

Hi Haleygo, We are using vmagent for sending metrics to VM. In this case /api/v1/write is used but not /api/v1/import.

Did you ingest the same values to v1.87.14, set all the values to "null"?

We have vmagent sending the same metrics to VM v1.87.14 and v1.93.14 with absolutely same settings. Only v1.93.14 has the issue with registering new timeserries if they have nulls value and hadn't existed before.

Haleygo commented 3 weeks ago

Ok, that's not expected. What the version of vmagent here? Did you test with target that down for a while and didn't see NaN values only in v1.93.14? I did a quick test with vmsingle v1.93.14(vmsingle shares same code with vmagent and vmcluster storage) and the NaN works. My test steps are:

  1. set up vmsingle v1.93.14 to scrape a target for few minutes;
  2. stop the target for a while;
  3. check the results. image

Only v1.93.14 has the issue with registering new timeserries if they have nulls value and hadn't existed before.

You mean the new timeseries start with null values, like the target expose metrics with null values? In my test, the NaN value is attached automatically by vmagent as stale marker.

perederyaev commented 3 weeks ago

What the version of vmagent here?

1.93.14

You mean the new timeseries start with null values, like the target expose metrics with null values?

We have process_exporter which is scraped by vmagent and then metrics sent via one more vmagent to two VMs v1.93.14 and v1.87.14 In vmagent's log we see: 2024-04-29T17:56:44.033Z warn VictoriaMetrics/lib/promscrape/scrapework.go:387 cannot scrape target "http://127.0.0.1:9256/metrics" 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: the response from "http://127.0.0.1:9256/metrics" exceeds -promscrape.maxScrapeSize=16777216 (the actual response size is 359579335 bytes); either reduce the response size for the target or increase -promscrape.maxScrapeSize

In v1.87.14 i see metric like NaNs in VMUI and like nulls in export: Снимок экрана 2024-04-29 в 21 10 53 In v1.93.14 I see no inserted metrics but slow insert and cache miss every minute with new scrape cycle.

Look's like related to staleness markers somehow but not sure how to reproduce issue from the scratch. Please check this tcpdump vm_bug.pcap.zip first tcp stream is to v1.93.14 (127.0.0.1) and it's not inserted with 'slow insert' and 'cache miss' every minute second stream to 1.87.14 (10.111.150.2) and it's inserted and has no 'slow insert' and 'cache miss' every minute

perederyaev commented 3 weeks ago

Managed to reproduce it with promremotecli - just modified it for sending "staleNaNBits uint64 = 0x7ff0000000000002". So when I send "0x7ff0000000000002" as value for new metric to VM v1.93.14 it doesn't register it, increases slow inserts and cache misses.

Haleygo commented 2 weeks ago

VictoriaMetrics should stop creating new time series when it receives staleness marker for new time series

Related to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5069.

In VictoriaMetrics, there are two different NaN, one is called staleNaN [using uint64 = 0x7ff8000000000002] https://github.com/VictoriaMetrics/VictoriaMetrics/blob/d386a68b59ec669ef42cddc0b8fab8145f14ebdd/lib/decimal/decimal.go#L407-L409 the other one is NormalNaN as math.NaN() [using uint64 = 0x7ff8000000000001]

vmagent or vmsingle only generate staleNaN value when metrics get missing, like target down, see this doc for details. When scraping target which expose metric like metric1_0{bar="baz"} NaN, or importing data which contains metric1_0{bar="baz"} null using /import APIs, VictoriaMetrics recognizes value like "null", "NaN", "nan", and set them to NormalNaN instead of staleNaN.

Then when it comes to store, VictoriaMetrics can tell the difference between staleNaN and NormalNaN, and only store staleNaN values. https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5e8c087d4244a4d82e11c1428e9699d2a00b6cb7/lib/storage/storage.go#L1811-L1816

image

From the raw samples in v1.87.14, there are consistent NaN values stored in VictoriaMetrics, it could happen when your target expose metircs with "NaN" value and target is flapping up&down(generate staleNaN).

But in v1.93.14, the "NormalNaN" is dropped as always, the "staleNaN" won't be considered as valid value and be dropped as well, so the series won't be registered successfully. But this won't happen if time series has "real" values, at least from time to time. Could you please elaborate your use case here, why store time series only with NaN values?

perederyaev commented 2 weeks ago

Could you please elaborate your use case here, why store time series only with NaN values?

We don't need to store time series with only NaN values. We want VM to be fast and stable when it (for some reasons) gets a lot of NaNs. In our case we just switched traffic from one VM cluster to other and got the issue with ingesting performance. We gathered metrics from process_exporter and sent them via vmagent to "old" cluster. When we added to vmagent remotewrite url of "new" cluster with v1.93.14 it started to have performance issues because every minute got millions of NaNs. One more thing that it was difficult to identify cause of the issue. We saw only strange slow inserts and cache misses and in the same time small number of active series and low churn rate.

hagen1778 commented 1 week ago

We don't need to store time series with only NaN values. We want VM to be fast and stable when it (for some reasons) gets a lot of NaNs.

This looks like a very narrow case. In your example, you're trying to ingest StaleNaNs, a reserved type of NaN for staleness detection. VM accepts StaleNaNs only if series for this sample was registered before with any value different from StaleNaN. There is no sense in recording/registering series which contains only StaleNaNs. To verify if series contains only StaleNaNs, VM does cache and index lookup, which is counted as cache miss and slowInsert.

vmagent will create a stale marker in two cases:

  1. target, that was previously successfully scraped, was removed from the scrape targets
  2. target returned list of metrics different to previous scrape. In this case, vmagent will find metric names from prev scrape missing in current scrape and will send stale markers only for these series.

either reduce the response size for the target or increase -promscrape.maxScrapeSize

I wasn't able to reproduce vmagent to send stale markers with this error. It is likely something weird is happening to vmagents in your setup. Could you try setting -promscrape.noStaleMarkers on vmagents side and see if issue can be reproduced?