Open bluecmd opened 6 years ago
Thank you for reporting this.
Could you please share more details about the crash dump? For example, the stack trace etc.? Also which SDK was FBOSS agent built against when the crash was observed? The post mentions 'reported crash in getdeps.sh', which crash? could you please share more details about it?
Hi,
I'm reporting an absence of a crash and recommending you to reconsider or provide more details of the crash referred to here: https://github.com/facebook/fboss/blob/346e30c2b84373cc8c674c5d2babd5b0876dc416/getdeps.sh#L118
Essentially "It works for us".
hi @bluecmd,
Are you sure you didn't have to do anything else to get opennsl 3.5.0.1 working? I can definitely believe that the crash in opennsl_pkt_alloc() has been fixed but there were a number of other changes that were required to get opennsl 3.5.0.1 working -- trivially, opennsl_driver_init()'s prototype changed -- see my changes here to at least get it to compile: https://github.com/facebook/fboss/pull/65
And even after it compiled, it was my experience that all of the packet forwarding was broken because the initialization process was quite different.
If you have it working, we'd definitely appreciate to understand how, because if we could update to OpenNSL 3.5.0.1, then we can unlock a bunch of previously unreleased changed (e.g., ACLs) that depend on newer versions.
Please confirm and let us know - thanks as always for the interest!
Hi @capveg. I admit it's a bit sneaky, but if you click on "OpenNSL 3.5.0.1" in my report you get the diff of the patch, and you'll see the actual code changes that we did.
Since we have Wedges graciously donated from FB running with ONL + FBOSS we're more than happy to help you collect any data that you need to debug any issues, but as far as we've seen It Just Works(TM) with the somewhat trivial patch of essentially only changing the opennsl_driver_init
call.
EDIT: Direct link to what I'm talking about here: https://github.com/dhtech/fboss/pull/4/files#diff-941e4fb204c29b957373093d97373880
EDIT x2: And we also needed to specify OPENNSL_CONFIG_FILE=/etc/config.wedge40
as the environment of course.
Hmm... so your patch looks effectively identical to my patch... so I'm wondering why your's works. I saw in one of the comments there a "Status: not working" - can you clarify? Just because the FBOSS agent logs "sending lldp to X" doesn't necessarily mean it's happening. Are you seeing that packet received on the other side? Sorry if this seems pedantic - but we've been (admittedly, slowly) debugging this for a while...
The status: not working is my quest to downrate the serdes's to support 1G line rate (https://github.com/Broadcom-Switch/OpenNSL/issues/37).
No worries, I also wouldn't trust strangers on the internet. What I can give you in terms of proof is the neighbouring Cisco switch receiving the LLDP and accepting them:
Switch#show lldp neighbors
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
wedge1 Te1/0/2 120 R XE5
Total entries displayed: 1
Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?
EDIT:
dhtech@wedge1:~$ strings /lib/modules/4.14.48-OpenNetworkLinux/linux-kernel-bde.ko | grep OpenNSL | head -n1
/home/bluecmd/OpenNSL/sdk-6.5.12-gpl-modules/include/sal
Thanks for all the info.
The kernel API is fairly stable so I'm not surprised that the 3.5.0.1 kernel modules work for older versions of OpenNSL. I wouldn't run that way long term (it's definitely not a tested setup :-), but not surprised it works. We run a fairly new kernel internally... let me confirm some details with some other folks and see if we can come up with a theory.
In any case, glad to hear this is working for you.
Just to add more data to keep myself honest:
dhtech@wedge1:~$ ldd /usr/local/bin/wedge_agent | grep libopennsl
libopennsl.so.1 => /usr/local/lib/libopennsl.so.1 (0x00007fe4583cc000)
dhtech@wedge1:~$ sudo find / -name libopennsl.so.1 | xargs sha1sum
c5a00a16bb0e0be3d557a6e21bc1ee43aa06d4c2 /usr/local/lib/libopennsl.so.1
That matches with the Dec-27 release that's current in https://github.com/Broadcom-Switch/OpenNSL/tree/master/bin/wedge. So I'm pretty sure I'm not messing up the versioning on my end.
Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?
Can you provide any more information about the hacks? Compiling OpenNSL 3.5.0.1 for the 4.14 kernel I have fixed pci_enable_msix, copy_to/from_user and dev->trans_start = jiffies; but FBOSS is still having issues:
I1010 20:50:13.945568 4058 BcmSwitch.cpp:560] Initializing BcmSwitch for unit 0 Aborted at 1539204614 (unix time) try "date -d @1539204614" if you are using GNU date PC: @ 0x560640774da2 std::unique_ptr<>::get()
It was a while since I hacked together the kernel modules, but https://github.com/dhtech/OpenNSL/commit/3e5a8afeaf82e4c0563b7c6f818e1c9a92dba989 + ONL 9 should be what we're running.
A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.
Ah @sonoble, looking at the last line of your report you're probably hitting https://github.com/facebook/fboss/issues/74. Not sure without the full stack trace however.
You can try using our fork that is using FBOSS from May with some patches applied: https://github.com/dhtech/fboss if you need it up and running right now.
It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.
A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.
No one runs knet that I know of. Looks like your changes are the same as mine. I build the entire OpenNSL from source, so I just set the KERNEL_SRC and LINUX_UAPI_SPLIT="1".
I don't need FBOSS running right now, I was just trying to confirm that your patch worked for me on the 40's. I have been working on getting everything working on the 100S but in a totally different way, by removing the init from OpenNSL and having FBOSS handle it.
I will build your fboss and see if I can get it working.
Thank you!
I built your fboss + the modified OpenNSL and while everything is running, there are no interfaces at all using your config or mine. I will dig more into it later.
@bluecmd I don't see it in this thread, have you been able to confirm packets other than LLDP are passing? We have seen LLDP packets before but were unable to ping or send any different traffic between boxes.
Only LLDP so far as well as normal L2 switching.
Hi @bluecmd I am able to confirm L2 and LLDP on the Wedge 100S but no L3 (Packets are not making it to the CPU) so no routing protocols can be run. Can you check if you assign an IP to a port that you can or cannot ping it? Thank you!
@sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS.
Here is a generic one from Facebook https://github.com/facebook/fboss/blob/master/fboss/agent/configs/sample3.json you just need to add the ip in the correct area.
On Mon, Oct 15, 2018, 11:16 PM Christian Svensson notifications@github.com wrote:
@sonoble https://github.com/sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/fboss/issues/76#issuecomment-430114145, or mute the thread https://github.com/notifications/unsubscribe-auth/AB_bHwNwrOWizM7xtt_U6ZjYDfvTbpXRks5ulXmigaJpZM4XNlXu .
So these are my observations. This is with 3.5.0.1 and our FBOSS fork from May/June. We have never tried running this with the old FBOSS, so I have no idea if this is a regression - but as requested by @sonoble.
I added an L3 interface like this:
"interfaces": [
{
"intfID": 10,
"routerID": 0,
"vlanID": 552,
"ipAddresses": [
"10.32.12.250/24"
]
}
]
This configured an fboss10
interface that does see the incoming packets:
I1017 21:55:00.147141 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:00.920305 10744 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:00.920405 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:01.147235 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:01.245810 10746 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:01.246166 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:01.246147 10745 SwSwitch.cpp:787] preparing state update add pending entry 10.32.12.1
V1017 21:55:01.246336 10745 NeighborCacheImpl-defs.h:137] Adding pending entry for 10.32.12.1 on interface 10
I1017 21:55:01.246403 10745 SwSwitch.cpp:921] Updating state: old_gen=6 new_gen=7
V1017 21:55:01.246495 10745 BcmSwitch.cpp:1048] updating VLAN 552: 0 ports added, 0 ports removed
V1017 21:55:01.246592 10745 BcmHost.cpp:394] created BcmHost: 10.32.12.1@vrf0. new ref count: 1
V1017 21:55:01.246645 10745 BcmSwitch.cpp:1259] adding pending neighbor entry to 10.32.12.1
V1017 21:55:01.246701 10745 BcmHost.cpp:149] Host entry for BcmHost: 10.32.12.1@vrf0 does not have an egress, create one.
V1017 21:55:01.246842 10745 BcmEgress.cpp:145] programmed L3 egress object 100005 for to CPU on unit 0 for ip: 10.32.12.1 @ brcmif 0 flags 8392704 towards port 0
V1017 21:55:01.246900 10745 BcmHost.cpp:594] insert egress 100005 into egress map
V1017 21:55:01.246962 10745 BcmHost.cpp:131] Adding host entry for : 10.32.12.1
V1017 21:55:01.247110 10745 BcmHost.cpp:135] created L3 host object for BcmHost: 10.32.12.1@vrf0 @egress 100005
V1017 21:55:01.247167 10745 BcmHost.cpp:177] Updating egress 100005 from physical port 0 to physical port 0
V1017 21:55:01.247382 10748 QsfpCache.cpp:101] All 64 ports up to date
V1017 21:55:01.247386 10745 SwSwitch.cpp:970] Update state took 981us
I1017 21:55:02.147334 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:02.247273 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:02.256384 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:02.256596 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:03.147433 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:03.247602 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:03.280399 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:03.280599 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:04.147526 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:04.247937 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:04.281910 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:04.282104 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:05.147614 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:05.248259 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:05.296386 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:05.296606 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
V1017 21:55:05.921184 10744 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:05.921291 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:06.147709 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:06.248940 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:06.320411 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:06.320631 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
ICMP replies are also sent (looking at tcpdump fboss10) but they never arrive at the pinger. The IP above is on the same subnet as the management, so there is a bit of ARP shortcuts that can be done there.
Using another IP address that is on its own subnet makes things break earlier. The fboss10 interface still shows some random IPv6 traffic that it captures, so packet capture works - however not much more than that.
FBOSS output:
V1017 22:09:51.115411 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 59
V1017 22:09:51.115458 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 60
V1017 22:09:51.115563 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for61
V1017 22:09:51.115658 10986 LldpManager.cpp:191] sent LLDP on port 61 with CPU MAC 56:ab:3a:05:fc:0a port id XE61 and vlan 552
V1017 22:09:51.115746 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for62
V1017 22:09:51.115817 10986 LldpManager.cpp:191] sent LLDP on port 62 with CPU MAC 56:ab:3a:05:fc:0a port id XE62 and vlan 552
V1017 22:09:51.115864 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 63
V1017 22:09:51.115912 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 64
I1017 22:09:52.114615 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:53.114711 10992 FunctionScheduler.cpp:505] Now running updateStats
V1017 22:09:53.225328 10986 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 77.80.231.34
V1017 22:09:53.225431 10986 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
I1017 22:09:54.114811 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:55.114902 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:56.115000 10992 FunctionScheduler.cpp:505] Now running updateStats
Notice that it is sending out an ARP broadcast but never logs an "sendPacketOutOfPort" message, following the code it is because this path calls "sendPacketSwitched". See here.
Maybe sendPacketSwitched is broken while sendPacketOutOfPort works?
Next steps to confirm that could be:
EDIT: I have a thesis this might also be related to L1 errors, I'll debug a bit and update.
Update: Yes, it was L1 error. Having fixed the cabling I can now see packets egressing as well. Ping doesn't work, but that is most likely FBOSS related.
1019 12:05:26.602193 3250 FunctionScheduler.cpp:505] Now running updateStats
V1019 12:05:26.783669 3244 UnresolvedNhopsProber.cpp:53] Sending probe for unresolved next hop: 77.80.231.34
V1019 12:05:26.783776 3244 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
tcpdump on computer:
14:05:26.681010 ARP, Request who-has 77.80.231.34 tell 77.80.231.34, length 50
Hi,
We're currently running FBOSS with a naively updated OpenNSL 3.5.0.1.
Since the reported crash in
getdeps.sh
should occur inopennsl_pkt_alloc
we verified the upgrade by using LLDP:No crash was observed.
Using OpenNSL 3.5.0.1 allows using modern kernel drivers and to configure the OpenNSL BCM configuration, so upgrading to it would probably interesting for a lot of folks.