facebook / fboss

Facebook Open Switching System Software for controlling network switches.
Other
866 stars 301 forks source link

Regression: crash on startup #74

Open bluecmd opened 6 years ago

bluecmd commented 6 years ago

Hi,

We just updated from e440fcd0e1abfbcea5a08b18494d54bb78c561d6 to current master. With the same configuration we now get:

$ sudo  /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json -alsologtostderr -v=10000
[ ... ]
xLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:37.147341  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
I0927 20:07:38.148428  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551589 next 18446744073709551590
I0927 20:07:38.148479  1304 EventBase.cpp:690] latest 18446744073709551589 next 18446744073709551590
I0927 20:07:38.148499  1304 EventBase.cpp:488] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) startWork_ 37729045998319
I0927 20:07:38.148531  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148555  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148582  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148607  1304 EventBase.cpp:356] EventBase 0x7ffcc88d4570 did not timeout  loop time guess: 1001350 idle time: 1001272 busy time: 78 avgLoopTime: 80.0261 maxLatencyLoopTime: 80.0261 maxLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:38.148648  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
Common SDK init completed
I0927 20:07:38.959753  1325 BcmSwitch.cpp:588]  Performing cold boot
I0927 20:07:38.959548  1325 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
I0927 20:07:39.149735  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551590 next 18446744073709551591
I0927 20:07:39.149780  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551591
I0927 20:07:39.149821  1304 EventBase.cpp:488] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) startWork_ 37730047318753
I0927 20:07:39.149868  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149914  1304 EventBase.cpp:690] latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149950  1304 EventBase.cpp:690] latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149984  1304 EventBase.cpp:356] EventBase 0x7ffcc88d4570 did not timeout  loop time guess: 1001366 idle time: 1001241 busy time: 125 avgLoopTime: 102.76 maxLatencyLoopTime: 102.76 maxLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:39.150044  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
*** Aborted at 1538078859 (unix time) try "date -d @1538078859" if you are using GNU date ***
PC: @     0x563ea0887664 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 1304 (TID 0x7fe0448f3700) from PID 16; stack trace: ***
    @     0x7fe05c0ae0c0 (unknown)
    @     0x563ea0887664 std::unique_ptr<>::get()
    @     0x563ea0887682 std::unique_ptr<>::operator->()
    @     0x563ea0882a38 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
    @     0x563ea0870a9a facebook::fboss::BcmSwitch::getColdBootSwitchState()
    @     0x563ea087305c facebook::fboss::BcmSwitch::init()
    @     0x563ea09f9fde facebook::fboss::SwSwitch::init()
    @     0x563ea092e57d facebook::fboss::Initializer::initImpl()
    @     0x563ea092de9c facebook::fboss::Initializer::initThread()
    @     0x563ea09366ac _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x563ea0936639 _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
    @     0x563ea09365ea _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
    @     0x563ea0936591 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x563ea09363cd std::_Bind_simple<>::operator()()
    @     0x563ea0936194 std::thread::_State_impl<>::_M_run()
    @     0x7fe055195e6f (unknown)
    @     0x7fe05c0a4494 start_thread
    @     0x7fe05490aacf clone
    @                0x0 (unknown)
Segmentation fault
dhtech@wedge1:~$ sudo  /usr/local

To me it looks like queueManager_ is not initialized by the time that https://github.com/facebook/fboss/commit/de9ad1ff2acc7a680eadb141c8786400c45027f0 introduced the call to getMulticastQueueSettings.

We will try to debug this, will post any findings.

bluecmd commented 6 years ago

Config: https://gist.github.com/bluecmd/82298eabea8a7a4bf7d447e263f125b5

bluecmd commented 6 years ago

My working thesis is that https://github.com/facebook/fboss/blob/master/fboss/agent/hw/bcm/oss/BcmControlPlane.cpp is missing an init of queueManager_

EDIT: We've tried initing queueManager like this: https://github.com/dhtech/fboss/commit/837bf7f0922a66fbfee13dbbe044cd8b3b7a6c04 without success. Current thesis is that controlPlane in BcmSwitch might not be correctly set. initTables for example seem to set it, but I cannot find any call for the function.

capveg commented 6 years ago

Thanks for the report - can you confirm before the update everything was working correctly? (sorry if it's a dumb question - I'm trying not to sound too surprised :-)

capveg commented 6 years ago

Also, unfortunately it's not trivial for me to test the OSS internally, so if you could compile with debugging options (similar to what's described here: https://stackoverflow.com/questions/7724569/debug-vs-release-in-cmake), we can get a stack trace with line numbers and can make a better guess as to what broke. Thanks for the interest!

bluecmd commented 6 years ago

We were running at the stated commit, or at least one around that commit, in June and it certainly worked. The same switch was unboxed from storage and the only thing we did was:

1) Modify the config to set port speed 1) Realize we have to patch fboss to support that transceiver we want to use 1) Compile a new fboss from head 1) Install libsodium23 from Debian stretch backports as that seems to be a new dependency 1) Start fboss - boom

We can certainly compile a version at the previously stated commit and run with it to verify the bisection window, and we will change the config back to the one we ran successfully in June.

bluecmd commented 6 years ago

And we will get back to you with the result of the cmake -DCMAKE_BUILD_TYPE=Debug run

bluecmd commented 6 years ago

This is the log for the startup based on:

dhtech@r1a0:~/fboss$ git status
HEAD detached at e440fcd
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   fboss/agent/ApplyThriftConfig.cpp
    modified:   getdeps.sh

no changes added to commit (use "git add" and/or "git commit -a")
dhtech@r1a0:~/fboss$ git diff
diff --git a/fboss/agent/ApplyThriftConfig.cpp b/fboss/agent/ApplyThriftConfig.cpp
index cc087e7..79f2bd8 100644
--- a/fboss/agent/ApplyThriftConfig.cpp
+++ b/fboss/agent/ApplyThriftConfig.cpp
@@ -328,7 +328,7 @@ shared_ptr<SwitchState> ThriftConfigApplier::run() {
     // Make sure there is a one-to-one map between vlan and interface
     // Remove this sanity check if multiple interfaces are allowed per vlans
     auto& entry = vlanInterfaces_[vlanInfo.first];
-    if (entry.interfaces.size() != 1) {
+    if (entry.interfaces.size() > 1) {
       auto cpu_vlan = newState->getDefaultVlan();
       if (vlanInfo.first != cpu_vlan) {
         throw FbossError("Vlan ", vlanInfo.first, " refers to ",
diff --git a/getdeps.sh b/getdeps.sh
index ec60709..19f8fc2 100755
--- a/getdeps.sh
+++ b/getdeps.sh
@@ -117,9 +117,9 @@ NPROC=$(grep -c processor /proc/cpuinfo)
     fi
     # iproute2 v4.4.0
     update https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git 7ca63aef7d1b0c808da0040c6b366ef7a61f38c1
-    update https://github.com/facebook/folly.git
-    update https://github.com/facebook/wangle.git
-    update https://github.com/facebook/fbthrift.git
+    update https://github.com/facebook/folly.git v2018.06.04.00
+    update https://github.com/facebook/wangle.git v2018.06.04.00
+    update https://github.com/facebook/fbthrift.git v2018.06.04.00
     update https://github.com/no1msd/mstch.git
     update https://github.com/facebook/zstd.git
     update https://github.com/google/googletest.git release-1.8.0

That should be the state where we ran it in June. This is the startup log:

dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 17:48:54.393641 23401 WedgeProductInfo.cpp:136] json parse error on line 0: expected json value
E0928 17:48:54.393896 23401 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
E0928 17:49:08.767557 23439 QsfpCache.cpp:166] Exception talking to qsfp_service: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0928 17:49:08.795991 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache
E0928 17:49:08.901234 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache

The transceiver error is due to I'm not having qsfp_service running, I currently have a weird QSFP in which makes qsfp_service crash, but I don't believe it's relevant here (and I'm remotely present so I cannot unplug it).

Building from master:

# docker run --name fboss_build_phase2 -v `pwd`:/tmp/code fboss/tmp /bin/sh -c "cd /tmp/code && mkdir -p build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. && make -j`grep -c processor /proc/cpuinfo`"
dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 18:20:48.462908 23573 WedgeProductInfo.cpp:140] json parse error on line 0: expected json value
E0928 18:20:48.463168 23573 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
I0928 18:20:48.507074 23573 Main.cpp:338] serving on localhost on port 5909
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
I0928 18:21:02.347016 23593 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
*** Aborted at 1538158862 (unix time) try "date -d @1538158862" if you are using GNU date ***
PC: @     0x564a3693b2d4 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 23573 (TID 0x7f70048f3700) from PID 16; stack trace: ***
    @     0x7f701be5e0c0 (unknown)
    @     0x564a3693b2d4 std::unique_ptr<>::get()
    @     0x564a3693b2f2 std::unique_ptr<>::operator->()
    @     0x564a369366a8 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
    @     0x564a3692470a facebook::fboss::BcmSwitch::getColdBootSwitchState()
    @     0x564a36926ccc facebook::fboss::BcmSwitch::init()
    @     0x564a36aae464 facebook::fboss::SwSwitch::init()
    @     0x564a369e2a03 facebook::fboss::Initializer::initImpl()
    @     0x564a369e2322 facebook::fboss::Initializer::initThread()
    @     0x564a369eab32 _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x564a369eaabf _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
    @     0x564a369eaa70 _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
    @     0x564a369eaa17 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x564a369ea853 std::_Bind_simple<>::operator()()
    @     0x564a369ea61a std::thread::_State_impl<>::_M_run()
    @     0x7f7014f45e6f (unknown)
    @     0x7f701be54494 start_thread
    @     0x7f70146baacf clone
    @                0x0 (unknown)
Segmentation fault

Not sure why the debug didn't take, do I need to do anything more than passing that to cmake?

capveg commented 6 years ago

Sorry for the slow reply to this. I'm also surprised that the debug symbols didn't pop up for this - let me figure that out in parallel. Looking at the code and the error, I think the patch below may help. I'm trying to test it locally, but our ability to emulate a non-facebook setup inside facebook is limited :-(

If you get a chance to test this before I do, please let me know if it fixes the problem.

diff --git a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
--- a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
+++ b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
@@ -20,8 +20,9 @@

 std::shared_ptr<PortQueue> BcmControlPlaneQueueManager::getCurrentQueueSettings(
     cfg::StreamType /*streamType*/,
-    opennsl_cos_queue_t /*cosQ*/) const {
-  return std::shared_ptr<PortQueue>{};
+    opennsl_cos_queue_t cosQ) const {
+  // stub implementation - depends on newer OpenNSL
+  return std::make_shared<PortQueue>(cosQ);
 }
capveg commented 6 years ago

also, if you could please create a new issue for the qsfp_service segfault. Even if it is because you have a funky optic, it still shouldn't segfault. I won't promise I can fix it promptly, but still good to track.

Thank you again!

bluecmd commented 6 years ago

Done, #77 . I'm compiling with your patch right now - I had to pull my OpenNSL 3.5.0.1 changes as well, but I guess they should be orthogonal to this issue.

bluecmd commented 6 years ago

@capveg The patch seems to not work sadly, https://gist.github.com/bluecmd/bd16185170dff642de197e34349aa14c

I wish the stack trace could be more useful. Granted, that particular build I did not run with the cmake debug settings, I can try to re-do that and spend some time to see if I can get line numbers if you feel it would be the next logical step.

cubic1271 commented 6 years ago

Quick note to say that I'll echo this behavior (and this trace) on a Wedge 100 and using the master current as of around a month ago. Stack trace is similarly not very useful. I did try to apply the patch described above with no useful results.

In our case, we also don't have a functional FBOSS version to revert to, 'cause the infrastructure for a Wedge 100 FBOSS on Open Network Linux seems to be pretty broken at the moment (for quite a few other reasons). I will try a build of e440fcd0e1abfbcea5a08b18494d54bb78c561d6 and seeing if that gets us going, though.

arnauddorgans commented 3 years ago

Still have crash on startup on EventBase::bumpHandlingTime in 2021 https://github.com/facebook/flipper/issues/2577