cilium / cilium-cli

CLI to install, manage & troubleshoot Kubernetes clusters running Cilium
https://cilium.io
Apache License 2.0
417 stars 210 forks source link

Add multicast subcommand #2574

Closed yushoyamaguchi closed 2 months ago

yushoyamaguchi commented 4 months ago

Proposal / RFE

Currently, for creating multicast groups and registering subscribers, it is necessary to enter the cilium-agent pod on each node and use the cilium-dbg command. This is fine but does not scale when we need to use multicast with many nodes in the cluster. https://github.com/cilium/cilium/pull/29469 https://github.com/cilium/cilium/pull/31355 https://github.com/cilium/cilium/pull/32612 For example, when using multicast application such as ROS 2 and DDS as distributed application, we have many nodes need to join the multicast group to establish the endpoint discovery in application layer.

Therefore, I would like to propose implementing commands in cilium-cli that creates a specified address group inside all cilium-agents and adds all other nodes to that group. Additionally, I would like to implement commands for deletion and status checking. https://github.com/cilium/cilium/pull/32612#discussion_r1605542681

Suggested command examples

These are just examples and we hope to change them for the better through the discussion.

#add all nodes as subscriber of 239.1.1.1 in all cilium-agent pods
>cilium multicast group 239.1.1.1 add all
#delete 239.1.1.1 group in all nodes
>cilium multicast group 239.1.1.1 delete all
#show list of nodes which have matched multicast group 
>cilium multicast group 239.1.1.1 list
nodeA , nodeB , nodeD
#show all lists of multicast group
cilium multicast list --all 

Furthermore

In the future, it would be beneficial to introduce commands like the following:

#add the specified node as subscriber of 239.1.1.1 in all cilium-agent pod
>cilium multicast group 239.1.1.1 add <node-name or node IP address>
#remove the specified node from 239.1.1.1 group in all cilium-agent pod
>cilium multicast group 239.1.1.1 delete <node-name or node IP address>
#add all nodes as subscriber of 239.1.1.1 by cilium-dbg
>cilium-dbg bpf multicast subscriber add 239.255.0.1 10.244.1.86
yushoyamaguchi commented 3 months ago

Currently, these operation is required for setting multicast. Obviously, this does not scale. https://github.com/fujitatomoya/ros_k8s/issues/40#issue-2209772781

Ideally, in the future, it may be a desired way to manage multicast group by manifest yaml as cilium custom resource. However, now I think we should extend the CLI to make it easier to do a uniform setup, first.

harsimran-pabla commented 3 months ago

Adding summary of what was discussed in today's community meeting and some of my thoughts :

Overall if this solves your immediate problem, it is a reasonable approach as long as we understand the limitations. With the proposed Cilium-CLI changes, I think we can go ahead with cluster-wide add/delete and list commands.

In the longer term, if there is value in more robust control over this configuration, we should make this declarative configuration per node via CRDs ( with node selector etc ), instead of managing per node configuration via cilium-cli.

cc @joestringer

fujitatomoya commented 3 months ago

@harsimran-pabla thanks for posting the note from today's conf call.

i guess my take is that this cilium-cli could be an extension of debug utility, at least listing current configuration makes sense. and it still works for certain use cases especially development with understanding the limitations as you described.

In the longer term, if there is value in more robust control over this configuration, we should make this declarative configuration per node via CRDs ( with node selector etc ), instead of managing per node configuration via cilium-cli.

totally agree, and this is required as well for production environment. probably as follow-up we can have dedicated issue for further discussion, but something i do not quite figure out via CRD (e.g node selector) is, what if we want to configure different multicast ip range with different node groups...

anyway, we can proceed to PR on this.

again, really appreciate your quick response!

Cheers, Tomoya

yushoyamaguchi commented 3 months ago

Thank you for comments. I'm going to start implementing.

harsimran-pabla commented 3 months ago

Thanks @yushoyamaguchi I will assign this issue to you.

yushoyamaguchi commented 3 months ago

Thank you.

yushoyamaguchi commented 3 months ago
# list 
## list group
cilium multicast list group
## list subscriber
cilium multicast list subscriber --all
cilium multicast list subscriber --group-ip 239.255.0.1
# add 
cilium multicast add --group-ip 239.255.0.1
# del 
cilium multicast del --group-ip 239.255.0.1

I changed CLI interfaces like that and I 've finished to implement prototype.

yushoyamaguchi commented 3 months ago

@harsimran-pabla

I'm sorry, I want to ask one question. Implementing test is required for merging? I think it is difficult to implement meaningful unit test in multicast subcommand, because the main function of this command is to run the cilium-dbg command on each cilium-agent...

harsimran-pabla commented 3 months ago

Hi @yushoyamaguchi, Yeah, I understand adding unit tests in CLI could be challenging. Depending on the complexity of the change, we can see if it makes sense to add them.

However, we need multicast e2e test cases controlled via these CLIs. We do not need to add these e2e test cases in the same PR, but they should be added subsequently.

joestringer commented 3 months ago

I would usually advocate for ensuring the test cases are available when merging the PR for a couple of reasons - (1) if there are no tests, then there's no guarantee the code works when you merge it. The code could break at any time. (2) It's easy to get excited about implementing the functionality, then lose interest in testing. The result can be that we end up in a situation with no tests, which makes it harder to detect and reproduce issues reported by users.

fujitatomoya commented 3 months ago

@yushoyamaguchi just following up with @joestringer and @harsimran-pabla , it depends on the situation and complexity... but most likely unit tests are required to add or modify the options and features with incoming PRs to make sure that does not break afterwards. i believe this is one of the most important things especially community and maintenance to keep the consistent behavior, otherwise we might not be able to realize that feature is broken somewhere sometime... this it gonna generate more complication and effort for the community.

after all, i would highly suggest we should have e2e test with this multicast command lines in the same PR, so that we can actually see if those are working in the flesh..

@joestringer @harsimran-pabla thanks for understanding and your kind comments.

harsimran-pabla commented 3 months ago

Thanks for taking up e2e test case development. One of the challenges of not being able to get e2e test and CLI change in the same PR is the way tests are set up.

I think this PR can contain the CLI and connectivity tests (validate those tests locally) and open a parallel PR in the cilium/cilium repo for running the test in CI. Do you have any suggestions, @joestringer?

fujitatomoya commented 3 months ago

@harsimran-pabla Ah, i mean,

e2e tests are written under connectivity tests in the cilium/cilium-cli repo.

this one is required as 1st step for sure as connectivity test.

and then we can work on cilium/cilium github actions.

joestringer commented 3 months ago

@harsimran-pabla that sounds sensible yeah. Part of the CLI implementation would be implementing detection of multicast support so that the multicast tests only run when multicast is enabled in Cilium, so in theory we should be able to integrate that testing code into cilium/cilium-cli at any time (the tests should be skipped on older versions of Cilium without multicast support, or run the test on newer versions with multicast support). The CLI typically performs its own CI testing against the latest stable Cilium release, currently v1.15.5. Presumably we'll bump that to v1.16.x Cilium after we publish a stable release in that series in the next couple of months.

Separately, let's say you've integrated the test into cilium/cilium-cli, then those tests will become part of the next cilium-cli release. After that CLI release, the cilium/cilium workflows get updated by renovate bot to pull in the new CLI version (+tests) into here for various CI workflows (currently CLI v0.16.10). Presumably multicast is disabled by default in cilium/cilium, so it wouldn't run the test by default. You would need to make a subsequent change to cilium/cilium workflows to either (a) enable multicast in an existing e2e testsuite (perhaps as a new matrix configuration), or (b) write a new github workflow in cilium/cilium to specifically test multicast.

I know this is a bit of a pain to line up, so I'm open to merging parts of this effort in different repositories if it expedites the continuing efforts. I just wanted to ensure we don't lose sight of the testing and maintainability aspects by focusing too heavily on the implementation. We can also make new CLI releases if it's needed in order to start consuming those tests from cilium/cilium. Feel free to reach out here or on Slack #development channel to coordinate this or any other questions.

Ultimately we probably want to end up with multicast e2e tests in both the cilium repo (to validate the underlying implementation) and CLI repo (to ensure the CLI commands properly configure the implementation). Each aspect would help detect and avoid regressions in different codepaths.

harsimran-pabla commented 3 months ago

Thanks @joestringer it makes perfect sense.

cc @ldelossa. I know e2e multicast testing was under your radar as well. (I hope you haven't started on it already.)

yushoyamaguchi commented 3 months ago

@harsimran-pabla @joestringer @fujitatomoya Thank you for your comments.

I am working on another issue on multicast for connectivity test and would like to proceed in parallel there. https://github.com/cilium/cilium-cli/issues/2615

As for the multicast subcommand, I think it is ok not to implement unit tests like the URL, so I would like to submit a PR without it. https://github.com/cilium/cilium-cli/blob/main/bgp/peers_test.go#L66

If you see the code that this kind of unit test seems necessary, I would appreciate it if you could point it out on the PR.

cc @ldelossa I'm sorry for adding cc. It is related to connectivity test.

yushoyamaguchi commented 3 months ago

I submitted the PR https://github.com/cilium/cilium-cli/pull/2620 . Thank you.