futurewei-cloud / alcor-control-agent

Cloud native SDN platform - network control agent
MIT License
14 stars 29 forks source link

ACA Segment Faulting when started after busybox container is started #229

Open kiran1048 opened 3 years ago

kiran1048 commented 3 years ago

On a compute node, start a busy box container and assign a IP/MAC to the container instance through: docker run -itd --name --net=none busybox sh ovs-docker add-port br-int eth1 --ipaddress=

--macaddress=

This creates a bridge br-int. Thereafter when you start the ACA on the same compute node, we see ACA crashing with segmentation fault as shown below: $ ./build/bin/AlcorControlAgent -d ACA_OVS_L2_Programmer::setup_ovs_bridges_if_need ---> Entering ACA_OVS_L2_Programmer::execute_ovsdb_command ---> Entering Executing command: ovs-vsctl br-exists br-int Trying to init a new sub to connect to the NCM After initing a new sub to connect to the NCM Streaming capable GRPC server listening on 0.0.0.0:50001 Command succeeded! Elapsed time for system command took: 4480 microseconds or 4 milliseconds. Elapsed time for ovsdb client call took: 4536 microseconds or 4 milliseconds. rc: 0 ACA_OVS_L2_Programmer::execute_ovsdb_command <--- Exiting, rc = 0 ACA_OVS_L2_Programmer::execute_ovsdb_command ---> Entering Executing command: ovs-vsctl br-exists br-tun Command failed!!! rc: 512 Elapsed time for system command took: 4017 microseconds or 4 milliseconds. Elapsed time for ovsdb client call took: 4074 microseconds or 4 milliseconds. rc: 512 ACA_OVS_L2_Programmer::execute_ovsdb_command <--- Exiting, rc = 512 Invalid environment br-int=1 and br-tun=0, cannot proceed ACA_OVS_L2_Programmer::setup_ovs_bridges_if_need <--- Exiting, overall_rc = 1 Segmentation fault (core dumped) (root:FW0009098):/root/pingtest/alcor-control-agent [master]

er1cthe0ne commented 3 years ago

It will be a good idea to fix the Segmentation fault, using gdb to pinpoint the code issue.

The actual problem is showed in the log "Invalid environment br-int=1 and br-tun=0, cannot proceed", br-int is created by the ovs-docker command. It created a situation where br-int exist but br-tun doesn't exist. ACA doesn't know how to proceed on this weird environment.

zzxgzgz commented 3 years ago

@er1cthe0ne

The issue is caused here: https://github.com/futurewei-cloud/alcor-control-agent/blob/master/src/aca_main.cpp#L221

In some of our test senarios, we might start some busybox containers and use ovs-docker add port ... command to add a port for the container, which causes the creation of the br-int(br-tun remains non-existent).

When we call the aca_ovs_l2_programmer::ACA_OVS_L2_Programmer::get_instance().setup_ovs_bridges_if_need(); function, it finds out that br-int is here but br-tun is not, and it is doing nothing but printing out a line of log of

Invalid environment br-int=%d and br-tun=%d, cannot proceed

In the following lines, ACA is trying to monitor the non-existent br-tun, which causes the seg fault.

If the main function returns here, the segmentation should be avoided.

kiran1048 commented 3 years ago

As @zzxgzgz suggested, when a check is made, we are able to prevent a crash:

$ ./build/bin/AlcorControlAgent -d ACA_OVS_L2_Programmer::setup_ovs_bridges_if_need ---> Entering ACA_OVS_L2_Programmer::execute_ovsdb_command ---> Entering Executing command: ovs-vsctl br-exists br-int Trying to init a new sub to connect to the NCM After initing a new sub to connect to the NCM Streaming capable GRPC server listening on 0.0.0.0:50001 Command succeeded! Elapsed time for system command took: 4449 microseconds or 4 milliseconds. Elapsed time for ovsdb client call took: 4503 microseconds or 4 milliseconds. rc: 0 ACA_OVS_L2_Programmer::execute_ovsdb_command <--- Exiting, rc = 0 ACA_OVS_L2_Programmer::execute_ovsdb_command ---> Entering Executing command: ovs-vsctl br-exists br-tun Command failed!!! rc: 512 Elapsed time for system command took: 3980 microseconds or 3 milliseconds. Elapsed time for ovsdb client call took: 4039 microseconds or 4 milliseconds. rc: 512 ACA_OVS_L2_Programmer::execute_ovsdb_command <--- Exiting, rc = 512 Invalid environment br-int=1 and br-tun=0, cannot proceed ACA_OVS_L2_Programmer::setup_ovs_bridges_if_need <--- Exiting, overall_rc = 1 ACA is not able to create the bridges, please check your environment