Closed hansedong closed 4 months ago
Hello @hansedong
VMInsert component is responsible for sharding data between VMStorage nodes. It uses consistent hash function to route the metrics to the same VMStorage nodes if they have the same set of metric name + labels. In case there will be 2 VMAgent instances scraping the same metrics VMInsert will route those metrics to the same storage nodes. Storage nodes will deduplicate the metrics later on during background merges. This way there will be exactly 2 copies of the data after deduplication will be applied.
@zekker6
Thank you for your reply. I would like to further understand.
In case there will be 2 VMAgent instances scraping the same metrics VMInsert will route those metrics to the same storage nodes. Storage nodes will deduplicate the metrics later on during background merges.
My understanding is that Metrics data itself includes type, name, labels, value, and time series. If VMStorage merges data in the background, it needs a criterion to determine whether the data is completely duplicated. What is the standard for this judgment?
As I mentioned earlier:
However, for multiple VMAgents, for example, instances agent-1-0 and agent-1-1, their instance startup cycle and data collection cycle may be different. For example, it is possible that agent-1-0 collects data in a cycle of 0-10-20, while agent-1-1 collects data in a time cycle of 5-15-25.
In a situation like this, where the collection periods of 2 vmagents are consistent but the absolute timestamps are not (for example: if the instances of vmagent start in different orders or if there have been restarts of vmagent, even though the collection periods are the same, the absolute timestamps differ, leading to different time sequences). Will this cause VMStorage to incorrectly judge that duplicate data exists?
@hansedong
My understanding is that Metrics data itself includes type, name, labels, value, and time series. If VMStorage merges data in the background, it needs a criterion to determine whether the data is completely duplicated. What is the standard for this judgment?
VMStorage fetches metric values and timestamps for these labels for each unique time series. Then it uses dedup.minScrapeInterval
to determine if there are several values stored without the same scrape interval. If that is the case it will pick the sample with the highest value and timestamp and only leave one sample as the result.
In a situation like this, where the collection periods of 2 vmagents are consistent but the absolute timestamps are not (for example: if the instances of vmagent start in different orders or if there have been restarts of vmagent, even though the collection periods are the same, the absolute timestamps differ, leading to different time sequences). Will this cause VMStorage to incorrectly judge that duplicate data exists?
However, for multiple VMAgents, for example, instances agent-1-0 and agent-1-1, their instance startup cycle and data collection cycle may be different. For example, it is possible that agent-1-0 collects data in a cycle of 0-10-20, while agent-1-1 collects data in a time cycle of 5-15-25.
In a situation like this, where the collection periods of 2 vmagents are consistent but the absolute timestamps are not (for example: if the instances of vmagent start in different orders or if there have been restarts of vmagent, even though the collection periods are the same, the absolute timestamps differ, leading to different time sequences). Will this cause VMStorage to incorrectly judge that duplicate data exists?
In this case dedup.minScrapeInterval
must be set to 5s, so that the following time ranges will be deduplicated: 0-5
, 5-10
, 10-15
, 15-20
, 20-25
. With an example above duplicated data points will be removed for 0-5
, 10-15
and 20-25
periods.
@zekker6 I basically understand VMStorage Deduplication logic already. Thanks a lot for your patient responses.
I have been trying to use VictoriaMetrics in our infrastructure to replace the Prometheus architecture I used before to solve the problem of time series data storage, especially, I hope VictoriaMetrics can achieve data of automation collection and data storage high availability.
Based on my current understanding of VictoriaMetrics, in VMCluster, data replication can be achieved by defining
replicationFactor: 2
. This means that when collecting data from a Target, there will be 2 replicas of the data stored in the VMStorage instances in the cluster.In addition, to solve the HA problem of VMAgent, this CRD resource also has a
replicaCount: 2
field, which can be configured with multiple copies.I have a doubt, which is whether this situation could lead to storing the data as 4 copies?
Currently, I understand that in the process of VMAgent->VMInsert->VMStorage, VMStorage has a mechanism
dedup.minScrapeInterval: 3s
to avoid duplicate storage of time-series data for the same target.However, for multiple VMAgents, for example, instances agent-1-0 and agent-1-1, their instance startup cycle and data collection cycle may be different. For example, it is possible that agent-1-0 collects data in a cycle of 0-10-20, while agent-1-1 collects data in a time cycle of 5-15-25.
In this case, does it mean that
dedup.minScrapeInterval: 3s
has no effect and ultimately results in the data becoming a 4-replica?Hope to get clarification, thanks a lot.