Open bluecmd opened 6 years ago
My working thesis is that https://github.com/facebook/fboss/blob/master/fboss/agent/hw/bcm/oss/BcmControlPlane.cpp is missing an init of queueManager_
EDIT: We've tried initing queueManager like this: https://github.com/dhtech/fboss/commit/837bf7f0922a66fbfee13dbbe044cd8b3b7a6c04 without success. Current thesis is that controlPlane in BcmSwitch might not be correctly set. initTables for example seem to set it, but I cannot find any call for the function.
Thanks for the report - can you confirm before the update everything was working correctly? (sorry if it's a dumb question - I'm trying not to sound too surprised :-)
Also, unfortunately it's not trivial for me to test the OSS internally, so if you could compile with debugging options (similar to what's described here: https://stackoverflow.com/questions/7724569/debug-vs-release-in-cmake), we can get a stack trace with line numbers and can make a better guess as to what broke. Thanks for the interest!
We were running at the stated commit, or at least one around that commit, in June and it certainly worked. The same switch was unboxed from storage and the only thing we did was:
1) Modify the config to set port speed 1) Realize we have to patch fboss to support that transceiver we want to use 1) Compile a new fboss from head 1) Install libsodium23 from Debian stretch backports as that seems to be a new dependency 1) Start fboss - boom
We can certainly compile a version at the previously stated commit and run with it to verify the bisection window, and we will change the config back to the one we ran successfully in June.
And we will get back to you with the result of the cmake -DCMAKE_BUILD_TYPE=Debug
run
This is the log for the startup based on:
dhtech@r1a0:~/fboss$ git status
HEAD detached at e440fcd
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: fboss/agent/ApplyThriftConfig.cpp
modified: getdeps.sh
no changes added to commit (use "git add" and/or "git commit -a")
dhtech@r1a0:~/fboss$ git diff
diff --git a/fboss/agent/ApplyThriftConfig.cpp b/fboss/agent/ApplyThriftConfig.cpp
index cc087e7..79f2bd8 100644
--- a/fboss/agent/ApplyThriftConfig.cpp
+++ b/fboss/agent/ApplyThriftConfig.cpp
@@ -328,7 +328,7 @@ shared_ptr<SwitchState> ThriftConfigApplier::run() {
// Make sure there is a one-to-one map between vlan and interface
// Remove this sanity check if multiple interfaces are allowed per vlans
auto& entry = vlanInterfaces_[vlanInfo.first];
- if (entry.interfaces.size() != 1) {
+ if (entry.interfaces.size() > 1) {
auto cpu_vlan = newState->getDefaultVlan();
if (vlanInfo.first != cpu_vlan) {
throw FbossError("Vlan ", vlanInfo.first, " refers to ",
diff --git a/getdeps.sh b/getdeps.sh
index ec60709..19f8fc2 100755
--- a/getdeps.sh
+++ b/getdeps.sh
@@ -117,9 +117,9 @@ NPROC=$(grep -c processor /proc/cpuinfo)
fi
# iproute2 v4.4.0
update https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git 7ca63aef7d1b0c808da0040c6b366ef7a61f38c1
- update https://github.com/facebook/folly.git
- update https://github.com/facebook/wangle.git
- update https://github.com/facebook/fbthrift.git
+ update https://github.com/facebook/folly.git v2018.06.04.00
+ update https://github.com/facebook/wangle.git v2018.06.04.00
+ update https://github.com/facebook/fbthrift.git v2018.06.04.00
update https://github.com/no1msd/mstch.git
update https://github.com/facebook/zstd.git
update https://github.com/google/googletest.git release-1.8.0
That should be the state where we ran it in June. This is the startup log:
dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 17:48:54.393641 23401 WedgeProductInfo.cpp:136] json parse error on line 0: expected json value
E0928 17:48:54.393896 23401 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0
Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
E0928 17:49:08.767557 23439 QsfpCache.cpp:166] Exception talking to qsfp_service: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0928 17:49:08.795991 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache
E0928 17:49:08.901234 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache
The transceiver error is due to I'm not having qsfp_service
running, I currently have a weird QSFP in which makes qsfp_service
crash, but I don't believe it's relevant here (and I'm remotely present so I cannot unplug it).
Building from master:
# docker run --name fboss_build_phase2 -v `pwd`:/tmp/code fboss/tmp /bin/sh -c "cd /tmp/code && mkdir -p build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. && make -j`grep -c processor /proc/cpuinfo`"
dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 18:20:48.462908 23573 WedgeProductInfo.cpp:140] json parse error on line 0: expected json value
E0928 18:20:48.463168 23573 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
I0928 18:20:48.507074 23573 Main.cpp:338] serving on localhost on port 5909
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0
Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
I0928 18:21:02.347016 23593 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
*** Aborted at 1538158862 (unix time) try "date -d @1538158862" if you are using GNU date ***
PC: @ 0x564a3693b2d4 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 23573 (TID 0x7f70048f3700) from PID 16; stack trace: ***
@ 0x7f701be5e0c0 (unknown)
@ 0x564a3693b2d4 std::unique_ptr<>::get()
@ 0x564a3693b2f2 std::unique_ptr<>::operator->()
@ 0x564a369366a8 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
@ 0x564a3692470a facebook::fboss::BcmSwitch::getColdBootSwitchState()
@ 0x564a36926ccc facebook::fboss::BcmSwitch::init()
@ 0x564a36aae464 facebook::fboss::SwSwitch::init()
@ 0x564a369e2a03 facebook::fboss::Initializer::initImpl()
@ 0x564a369e2322 facebook::fboss::Initializer::initThread()
@ 0x564a369eab32 _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
@ 0x564a369eaabf _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
@ 0x564a369eaa70 _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
@ 0x564a369eaa17 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
@ 0x564a369ea853 std::_Bind_simple<>::operator()()
@ 0x564a369ea61a std::thread::_State_impl<>::_M_run()
@ 0x7f7014f45e6f (unknown)
@ 0x7f701be54494 start_thread
@ 0x7f70146baacf clone
@ 0x0 (unknown)
Segmentation fault
Not sure why the debug didn't take, do I need to do anything more than passing that to cmake?
Sorry for the slow reply to this. I'm also surprised that the debug symbols didn't pop up for this - let me figure that out in parallel. Looking at the code and the error, I think the patch below may help. I'm trying to test it locally, but our ability to emulate a non-facebook setup inside facebook is limited :-(
If you get a chance to test this before I do, please let me know if it fixes the problem.
diff --git a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
--- a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
+++ b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
@@ -20,8 +20,9 @@
std::shared_ptr<PortQueue> BcmControlPlaneQueueManager::getCurrentQueueSettings(
cfg::StreamType /*streamType*/,
- opennsl_cos_queue_t /*cosQ*/) const {
- return std::shared_ptr<PortQueue>{};
+ opennsl_cos_queue_t cosQ) const {
+ // stub implementation - depends on newer OpenNSL
+ return std::make_shared<PortQueue>(cosQ);
}
also, if you could please create a new issue for the qsfp_service segfault. Even if it is because you have a funky optic, it still shouldn't segfault. I won't promise I can fix it promptly, but still good to track.
Thank you again!
Done, #77 . I'm compiling with your patch right now - I had to pull my OpenNSL 3.5.0.1 changes as well, but I guess they should be orthogonal to this issue.
@capveg The patch seems to not work sadly, https://gist.github.com/bluecmd/bd16185170dff642de197e34349aa14c
I wish the stack trace could be more useful. Granted, that particular build I did not run with the cmake debug settings, I can try to re-do that and spend some time to see if I can get line numbers if you feel it would be the next logical step.
Quick note to say that I'll echo this behavior (and this trace) on a Wedge 100 and using the master current as of around a month ago. Stack trace is similarly not very useful. I did try to apply the patch described above with no useful results.
In our case, we also don't have a functional FBOSS version to revert to, 'cause the infrastructure for a Wedge 100 FBOSS on Open Network Linux seems to be pretty broken at the moment (for quite a few other reasons). I will try a build of e440fcd0e1abfbcea5a08b18494d54bb78c561d6 and seeing if that gets us going, though.
Still have crash on startup on EventBase::bumpHandlingTime
in 2021 https://github.com/facebook/flipper/issues/2577
Hi,
We just updated from
e440fcd0e1abfbcea5a08b18494d54bb78c561d6
to current master. With the same configuration we now get:To me it looks like queueManager_ is not initialized by the time that https://github.com/facebook/fboss/commit/de9ad1ff2acc7a680eadb141c8786400c45027f0 introduced the call to getMulticastQueueSettings.
We will try to debug this, will post any findings.