Open sdonovan1985 opened 5 years ago
I fixed the environment issue: sudo wasn't passing through PYTHONPATH correctly. Fixed that in the Jenkins Build configuration.
Current problem (that appeared once PYTHONPATH was fixed): The OVS switch instance isn't being created. The DB is being updated, so ovs-vsctl show
does work, but when you run ovs-ofctl show br_ovs
it fails. Below are details:
root@955bfc3d3cc9:/home/jenkins# ovs-ofctl show br_ovs
ovs-ofctl: br_ovs is not a bridge or a socket
root@955bfc3d3cc9:/home/jenkins# ovs-vsctl show
c6d2b3e7-94e6-44aa-b8b7-0cd045059890
Manager "ptcp:6640"
Bridge br_ovs
Port br_ovs
Interface br_ovs
type: internal
ovs_version: "2.6.2"
root@955bfc3d3cc9:/home/jenkins# ovs-ofctl show br_ovs
ovs-ofctl: br_ovs is not a bridge or a socket
root@955bfc3d3cc9:/home/jenkins#
Below are the errors from creating the OVS:
root@955bfc3d3cc9:/home/jenkins# tail /var/log/openvswitch/ovs-vswitchd.log
2019-08-21T15:36:54.028Z|00008|memory|INFO|4416 kB peak resident set size after 176.7 seconds
2019-08-21T15:36:54.029Z|00009|dpif|WARN|failed to create datapath ovs-system: Operation not permitted
2019-08-21T15:36:54.029Z|00010|ofproto_dpif|ERR|failed to open datapath of type system: Operation not permitted
2019-08-21T15:36:54.029Z|00011|ofproto|ERR|failed to open datapath br_ovs: Operation not permitted
2019-08-21T15:36:54.029Z|00012|bridge|ERR|failed to create bridge br_ovs: Operation not permitted
Chasing this one down today.
Found the solution thanks to https://mail.openvswitch.org/pipermail/ovs-discuss/2015-February/036452.html - I just needed to run the container as privileged. No code changes needed.
sudo docker run --name ssh-slave --network jenkins-test-network --privileged sdonovan:ssh-slave-mod "ssh-rsa <SSHKEY> jenkins"
So the checkins yesterday cleaned up a few of the failures (3 of 20) that I was seeing. All of these tests work just fine in isolation. These three issues were due to the cleanup of virtual switches usually.
Now, the rest of the errors are all connected: once I fix whatever the cause is, all of the errors will go away. Like I said, these tests run just fine in isolation.
sudo nosetests test_LocalController.py test_RyuTranslateInterface.py
This is the minimal reproducible command to reproduce the 17 (or 16 now?) failure case.
sudo nosetests test_LocalController.py; sudo nosetests test_RyuTranslateInterface.py
This, strangely, works. Notice that it's just splitting the two sets of tests into two processes. Weird.
sudo nosetests test_RyuTranslateInterface.py test_LocalController.py
Notice, that this is just the same tests, reversed. This works just fine.
So, there's something fishy w/r/t timing(?) or ordering or some sort of state that I haven't figured out yet.
sudo nosetests test_LocalController.py:LocalControllerTest.test_rule_installation_4 test_RyuTranslateInterface.py:RyuTranslateTests.test_trans_match_multi
Minimal reproduction.
Huh. So I've been beating this all day, trying different little things, trying to figure out what's going awry.
What I've found is that after the LC test is run, during the RyuTranslateTest, either Ryu doesn't seem to be running OR the virtual switch is not connecting to Ryu. That's next to find out.
Hmm... progress? Information?
root 29497 16.0 0.7 298504 56528 ? Ss 15:42 0:00 /bin/python /bin/ryu-manager --app-list /home/sdx/dev/localctlr/RyuTranslateInterface.py --log-dir . --log-file ryu.log --verbose --ofp-tcp-listen-port 6633 --atlanticwave-lcname atl --atlanticwave-conffile /home/sdx/dev/localctlr/tests/rtitest.manifest
[sdx@localhost tests]$ sudo kill -9 29497
[sdx@localhost tests]$ ps aux | grep ryu-manager | grep -v grep
root 29497 2.9 0.0 0 0 ? Zs 15:42 0:00 [ryu-manager] <defunct>
[sdx@localhost tests]$ ps aux | grep ryu-manager | grep -v grep
root 29497 2.7 0.0 0 0 ? Zs 15:42 0:00 [ryu-manager] <defunct>
[sdx@localhost tests]$ ps aux | grep ryu-manager | grep -v grep
root 29497 2.0 0.0 0 0 ? Zs 15:42 0:00 [ryu-manager] <defunct>
[sdx@localhost tests]$ ps aux | grep ryu-manager | grep -v grep
Line 1 - ps - is during the LocalController test.
Line 2 - kill -9 - is after LocalController test has finished cleanup (during a sleep() period)
Lines 3,4,5 - During the RyuTranslateInterface test
Line 6 - including the empty line - is after the test is finished.
So, defunct is described in the ps man page:
Processes marked <defunct> are dead processes (so-called "zombies")
that remain because their parent has not destroyed them properly.
These processes will be destroyed by init(8) if the parent process
exits.
So, the final question is how do I handle this part?
Ok, down to three errors with Jenkins. My manual tests were done in the order below:
export PYTHONPATH=.:/home/sdx/mininet:/home/sdx/dev:/home/sdx/ryu; sudo pkill ryu-manager; sudo mn -c; sudo nosetests test_LocalController.py test_RyuControllerInterface.py test_RyuTranslateInterface.py
and then
export PYTHONPATH=.:/home/sdx/mininet:/home/sdx/dev:/home/sdx/ryu; sudo pkill ryu-manager; sudo mn -c; sudo nosetests
Have to look at the (huge) logs from Jenkins to find out where the issues are coming from. I'm wondering if I need to replicate the little loop I have for finding out of the switch is connected to the LC, since it does take a variable, and rather unpredictable, amount of time to reconnect to Ryu.
Not sure where to put this in the codebase, but attaching a pared down Jenkins configuration script. All tests (finally) successfully work in my local instance. All 262 tests passing. 67% of conditionals.
Need to get this running on the RNOC cloud infrastructure next.
Basic setting up of Jenkins and whatever infrastructure (docker images, configuration scripts, etc.) are necessary. First, we're going to do this on a local VM, then move it to the RNOC infrastructure once I get the hang of how to use it. .