zm1990s commented 1 year ago

Description Antrea only monitors Controller/Agent status at the moment, and Controller/Agent's status doesn’t means East-West connectivity is good, and metrics provieded by Antrea also does not reflect to Pod to Pod connectivity. From an application perspective, we need a tool that can detect and inform Pod-to-pod connectivity issues.

Core feature required A tool (maybe a Daemonset) that can generate East/West traffic periodically and check whether the E/W connectivity is good. if some of the detection fails, alerts or logs should be send out to external monitoring tools.

The detection interval should be adjustable like traditional loadbalancer do, for example send detection every 1 second and when 3 consecutive detection fails, sends out an alert.

Other related features Since we're doing a E/W monitoring tool, so other related Antrea features can be monitored too. For example:

Pod to service type ClusterIP
Pod to service type Nodeport
Pod to external network via SNAT

tnqn commented 1 year ago

@zm1990s Thanks for the proposal. Monitoring EW connectivity should be feasible. But having extra long running Pods, especially DaemonSet deployed in user's cluster for this purpose may not be wanted by most users. We can probably leverage the state of memberlist ran by antrea-agent as the source of connectvitiy status. Regarding to alerts, K8s events associated with the unreachable Node may be a way. However, we need to ensure no duplicate events would flood the event system. Using consistent hash to select one "reporter" node among available nodes may be feasible.

Regarding Pod-to-Service, Pod-to-External monitoring, I'm not sure if it could be really helpful and practicable to proactively generating traffic. It's not easy to even know whether a particular access is supposed to succeed or not given different policy, firewall, network topology configuration, not to mention the generated traffic on behalf of user's application or towards user's application may be not wanted by many users. I think in practice most such tools are implemented as script/playbook executed out-band using user's own application according what they want to monitor.

cc @jianjuns @antoninbas @salv-orlando

tnqn commented 1 year ago

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status. If user just wants such status is reported faster, they can also just update node-monitor-grace-period, so still wondering what value this can really add.

tnqn commented 1 year ago

A tool (like anctl subcommand) for smoke testing may be the most practicable way in the end.

antoninbas commented 1 year ago

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status.

I think that from an Antrea perspective, it would be good to monitor the health of the overlay network (in encap mode) by running some ping-mesh across all gateways. Being able to report latency across Nodes would also be quite nice, but I don't think we can do that with memberlist (IIRC, we discussed that in the past). With latency data available, we could even display a heat map in Antrea UI and update it in real-time.

jianjuns commented 1 year ago

Agree with what @antoninbas said.

tnqn commented 1 year ago

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work. But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

antoninbas commented 1 year ago

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work.

I think overlay (ping between gateway) is a bit more "end-to-end". In addition to port whitelisting, we could potentially detect issues like a missing route on the host (granted, that has not happened in a while, but we used to have such issues). I was thinking that with the right "probe" (e.g. a TCP data exchange), the health check would also fail in case of checksum issue (basically any issue with the NIC configuration that is specific to double encapsulation).

But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

The latency heat map is something that has been on my mind for a while. I remember someone telling me that Weave had something like this, but I can't find a reference to it. I brought this up very superficially when we added memberlist as a dependency: https://github.com/antrea-io/antrea/issues/2128#issuecomment-827257525

zm1990s commented 1 year ago

@tnqn I think this tool should be decoupled from Antrea Controller/Agent, just like nsx-interworking. Users can decide whether they need to use it or not.

antoninbas commented 1 year ago

Assigning to @tushartathgur who said he would look into this. cc @yuntanghsu as well.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

antoninbas commented 10 months ago

We have submitted this issue as a project idea for the LFX mentorship program: https://github.com/cncf/mentoring/pull/1129. So no-one should ideally work on this issue until we know if the proposal is accepted and if we can match a mentee to work on it. See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program.

prakrit55 commented 10 months ago

@antoninbas, I am greatly interested in the project, how do I reach you guys in slack or are there other options? Though I have an intermediate knowledge of k8s and golang, I am curious to know how much frontend approach has to be driven here. Thank you.

antoninbas commented 10 months ago

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (https://github.com/antrea-io/antrea/issues/5514#issuecomment-1906707462). If you are interested, you could consider applying for the LFX mentorship program.

prakrit55 commented 10 months ago

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program.

Hey thank you @antoninbas, I got your channel. I would really like to apply for lfx mentorship for it, in the term March-May.

btwshivam commented 10 months ago

@antoninbas The prospect of working collectively on a comprehensive project like this is truly exciting, and I am keen on contributing my skills and enthusiasm to its success. The outlined sub-projects align perfectly with my interests, and they present a great opportunity for learning growth, and industrial exposure. I look forward to contributing to the project and learning from the experience.

nate-double-u commented 10 months ago

Hello, everyone. I'm pleased to see how many folks are interested in participating in the LFX Mentorship Program.

Upstream issues like this are an excellent place to discuss specific technical topics or provide ideas about how you may tackle a problem; however, please post any questions about the LFX program and how to apply on the mentorship discussion forums (and indeed, some of these questions may have already been answered there, or on the Program Guidelines page).

antoninbas commented 9 months ago

For all the folks who have applied or are considering applying to one of the Antrea projects for the LFX mentorship program, we have published instructions to complete test tasks: https://github.com/antrea-io/antrea/discussions/5976. We will review your submissions for these tasks alongside other material (resume, cover letter) when selecting mentees. The deadline for submitting is February 20th 5PM PST.

antoninbas commented 9 months ago

@IRONICBo will work on this as part of the LFX mentorship program

IRONICBo commented 8 months ago

Monitoring tool api design proposal

Monitoring tool needs a uniform config

Users and administrators need a way to measure and monitor network performance, specifically the latency between nodes, to ensure optimal cluster performance and troubleshoot potential issues.

Watch a singleton CRD

The proposed solution is to introduce a new Custom Resource Definition (CRD) called PingMonitoringToolConfig in Antrea. This CRD will allow users to enable and configure a ping monitoring tool that measures the latency between nodes. The configuration will include parameters such as the ping interval, timeout, and concurrency limit.

The Antrea agents will listen for changes to this CRD and adjust their monitoring behavior accordingly. When we enable this monitoring feature in Feature gate and config, agent will watch the events of creation/update/deletion of this CRD and update the start, stop and parameter update of monitor tool in real time.

Additionally, a singleton pattern will be enforced using a validation webhook to ensure that only one instance of the CRD exists in the cluster.

Use Feature Gate & Config & CRD to start monitoring tool

The solution introduces a new user-facing feature that allows users to enable and configure the ping monitoring tool via a YAML config file. Users can apply this YAML file using kubectl to create or update the PingMonitoringToolConfig resource.

The changes will be automatically picked up by the Antrea agents, and the monitoring behavior will be updated accordingly. This feature provides users with a structured and easy-to-consume API for enabling and configuring the ping mesh feature.

Main design/architecture

The main design involves the following components:

CRD Definition: A new CRD PingMonitoringToolConfig will be defined with fields for enabling the tool, ping interval, timeout, and concurrency limit. Here is an example of the CRD definition in Go:

type PingMonitoringToolConfigSpec struct {
    PingInterval        string `json:"pingInterval,omitempty"`
    PingTimeout         string `json:"pingTimeout,omitempty"`
    PingConcurrentLimit int    `json:"pingConcurrentLimit,omitempty"`
}

type PingMonitoringToolConfig struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec PingMonitoringToolConfigSpec `json:"spec,omitempty"`
}

Singleton Pattern Enforcement: A validation webhook will be implemented to ensure that only one instance of the PingMonitoringToolConfig resource can exist in the cluster. This webhook will reject the creation of additional instances if one already exists.
Agent Behavior: Antrea agents will listen for changes to the PingMonitoringToolConfig resource and update their monitoring behavior based on the configuration. The agents will use a Kubernetes client to watch for changes to the resource and adjust their ping interval, timeout, and concurrency limit accordingly.
Monitoring Logic: The ping monitoring tool will measure the latency between nodes and provide metrics that can be used for monitoring and troubleshooting. The tool will use ICMP ping requests to measure the latency between nodes. Here is an example of a YAML configuration file for the PingMonitoringToolConfig resource:
```
apiVersion: networking.antrea.io/v1alpha1
kind: PingMonitoringToolConfig
metadata:
name: default
spec:
pingInterval: "10s"
pingTimeout: "5s"
pingConcurrentLimit: 10
```
In this example, the ping monitoring tool is enabled with a ping interval of 10 seconds, a ping timeout of 5 seconds, and a concurrency limit of 10.

Alternative solutions

Using a ConfigMap: Instead of a CRD, a ConfigMap could be used to configure the ping monitoring tool. However, this approach lacks the structure and validation capabilities provided by CRDs.
Using antctl API Server: We need to register an apiServer and use antctl to update the parameters of our monitoring tool. We need to consider the uniformity and observability of the configuration parameters in the cluster.

This proposal aims to provide a flexible and user-friendly way to monitor node-to-node latency in a Kubernetes cluster, enhancing the observability and manageability of the network performance in Antrea-managed clusters.

Dyanngg commented 8 months ago

A validation webhook won't be necessary if we simply add an open-api validation rule which constraints the name of the CRD object created. See https://github.com/kubernetes-sigs/network-policy-api/blob/main/apis/v1alpha1/baselineadminnetworkpolicy_types.go#L29 as an example

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

antoninbas commented 3 months ago

With the addition of the NodeLatencyMonitor feature in Antrea v2.1 (thanks @IRONICBo!), I will now close this issue. Additional capabilities for this feature can be added over time. An issue has been created for the addition of a latency visualization dashboard in the Antrea UI: https://github.com/antrea-io/antrea-ui/issues/455

antrea-io / antrea

East/west connectivity monitoring tool #5514

Monitoring tool api design proposal