NXP / isochron

Tool for Time Sensitive Networking testing
GNU General Public License v2.0
41 stars 12 forks source link

802.1CB 2 Board #1

Closed jman-88 closed 3 years ago

jman-88 commented 4 years ago

Will there be an update to the 8021cb-2boards branch? I am trying to do a CB test with two LS1028A-RDB boards, but I can't get it going by modifying the 8021cb-devel branch.

vladimiroltean commented 4 years ago

On Thu, Sep 03, 2020 at 12:08:39AM -0700, JMan wrote:

Will there be an update to the 8021cb-2boards branch? I am trying to do a CB test with two LS1028A-RDB boards, but I can't get it going by modifying the 8021cb-devel branch.

What do you need to modify? You should be able to just use the json files for board1 and board2 from the 8021cb-devel branch, and disregard the third. The connection would go from board2 swp0 to board1 swp1 and from board2 swp1 to board1 swp0. What is the issue you're facing when trying to do that?

Thanks, -Vladimir

jman-88 commented 4 years ago

Hi Vladimir

Thanks for the assistance.

I tried as you suggested first, but the boards fail to ping each other. I then went and modified the configs etc. to remove references to board 3, but that also did not work. I tried the provided config again with no luck.

I have swp0 to swp1 on both boards. I can see the IPs and MACs getting assigned to eno2 as expected, but I can't ping 172.15.0.2 from board 1 to board 2 or vice versa. Should there be any output from the script when it is run?

I'm wondering if it is my image build. I used buildroot from OpenIL. The only change I made was to remove linuxptp as it was failing to compile. This was on master branch so the CB support should be in (551cd14)?

Kind regards Jan

vladimiroltean commented 4 years ago

The only change I made was to remove linuxptp as it was failing to compile.

wow, this doesn't sound very nice. Let me build the openil master branch and double-check if it works.

vladimiroltean commented 4 years ago

So first of all, the linuxptp package built successfully right now, with the master branch at https://github.com/openil/openil/commit/65d832a2ef1479ee5a51e6990a5fcb61ff3dc71f. Could you share an error log?

vladimiroltean commented 4 years ago

Ok, so there was indeed a recently introduced incompatibility between the return code of a tsntool command, and what this script expected. I fixed it in here: https://github.com/vladimiroltean/tsn-scripts/commit/6e0e291b1df82613a6a3e9cd358682ad2f78d8cd I could ping fine with that 1-line change. Could you please give it another go?

jman-88 commented 4 years ago

I did a fresh build on a different machine with master. Basically I get two different issues with the build. First I encounter multiple definition of 'yylloc' during various times of the build. I managed to circumvent this. Then the PTP build issue. I think the PTP build might be looking for the header files on my system and not inside buildroot, but this is just a guess. I attached some logs for you.

uboot.txt host-dtc.txt linux.txt ptp.txt

The update to the tsn-scripts worked, but when I run iperf and disconnect the swp1 -> swp0 cable, the stream dies. When I disconnect swp0 -> swp1 the stream is fine. It appears that something is still missing. Do you get the same?

vladimiroltean commented 4 years ago

Yes, you are correct. I have fixed that. Please re-update the list of kernel patches and try again, it should work fine this time. I will send these patches to OpenIL as soon as possible.

I didn't unfortunately have time to study your compilation issues. What host distribution are you using?

jman-88 commented 4 years ago

Great. It is working now, but I'm observing some strange behavior. As I understand, I should be able to disconnect either swp0 or swp1 and the stream should not be interrupted as longs as one cable is connected. I observe the following:

Is this expected behavior or am I misunderstanding what CB should be able to do?

I'm on Manjaro Linux.

vladimiroltean commented 4 years ago

Have you updated all Linux kernel patches? What if you use ping instead of iperf3, do you ever run into this problem? Have you looked at the patches in detail, this one especially?

Subject: [PATCH 2/2] net: dsa: felix: implement port flushing on
 .phylink_mac_link_down

Especially with flow control enabled on both the user port and the CPU
port, it may happen when a link goes down that Ethernet packets are in
flight. In flow control mode, frames are held back and not dropped. When
there is enough traffic in flight (example: iperf3 TCP), then the
ingress port might enter congestion and never exit that state. This is a
problem, because it is the egress port's link that went down, and that
has caused the inability of the ingress port to send packets to any
other port.

The solution is to follow the port flushing procedure from the reference
manual. This ensures that upon detection of link loss, the existing
packets are thrown away and congestion on the ingress port is therefore
avoided.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Is this what's going on, I wonder? If you run ethtool -S swp4 | grep pause repeatedly (I just want to see if it keeps increasing) after iperf3 freezes and you stop it, what do you see? By any chance, has the port entered congestion and it keeps sending PAUSE frames to ENETC?

vladimiroltean commented 4 years ago

I'm on Manjaro Linux.

I don't think OpenIL gets build-tested on that, sorry.

jman-88 commented 4 years ago

Have you updated all Linux kernel patches?

Yes. I copied those in, removed output-ls1028ardb/build/linux-linux-5.4.y and built again.

What if you use ping instead of iperf3, do you ever run into this problem?

I tested with ping now and did not manage to replicate the problem.

Have you looked at the patches in detail, this one especially?

I did not look at it before, but this is probably it. I tested now and when the setup fails and I stop iperf, the tx_pause counter keeps on increasing. When I reconnect the cable, the tx_pause counter stops.

vladimiroltean commented 4 years ago

:( So it looks like I'm missing a case when I need to flush the packets on the egress port when its link goes down. Sorry, I have some other things I need to focus on right now, I'll come back to it at some point. To work around this issue, you can edit this patch: https://github.com/vladimiroltean/tsn-scripts/blob/8021cb-devel/deps/patches/linux/0001-arm64-dts-fsl-ls1028a-rdb-enable-swp5-and-move-NPI-p.patch and remove the "pause" properties from the fixed-link node of the ENETC and of the Felix switch. This will disable flow control on the eno2 <-> swp4 port pair, which in turn will mean that the packets are no longer buffered by the switch when the link falls, but will be dropped instead. Be warned though, the iperf3 throughput without flow control will be worse.

jman-88 commented 4 years ago

Okay. Thanks for the assistance. As long as there is a work around it is fine for now. Should I close this issue or keep it open?

vladimiroltean commented 4 years ago

Let me know if you see the lockups any longer with flow control disabled, and if that confirms my suspicion, you can close it and I'll return to this issue in a few days anyway. I'm trying to upstream the packet flushing logic and will make sure it works properly when I do that. How many times do you need to plug/unplug the cable? For me it worked 3-4 times, haven't tested further.

jman-88 commented 4 years ago

How many times do you need to plug/unplug the cable?

Sometimes it will break on the first disconnect, but it takes longer usually so it's a bit random. I was probably up to 20 plug/unplugs once before it happened.

I'll test the work around and let you know.

jman-88 commented 4 years ago

With the work around I'm getting the same behavior. I double checked the device tree changes and ethtool -S swp4 | grep pause yields 0 and I can see decreased throughput with iperf. Will probably be good if someone else can confirm this.

I'll get back to this at some point as I have other things to deal with. I'll keep this open in case I forget.

vladimiroltean commented 4 years ago

Ok, so it isn't completely what I thought. I'll try to come back at it and see what's going on. Keep the ticket open, sure.

vladimiroltean commented 4 years ago

What ethtool counter is increasing now, if not tx_pause, though? Where are the packets dropped?

jman-88 commented 4 years ago

That is a good question because I did not see anything funny going on with the counters of swp4 when the stream breaks.

vladimiroltean commented 4 years ago

So there are actually 6 places to check.

Packets must be dropped somewhere, right?

jman-88 commented 4 years ago

The port activity LEDs stop flashing when it breaks so it probably drops locally.

I'll do some more testing when I have time.

crow1814 commented 3 years ago

Hi, I am trying to verify TSN features on LS1021A-TSN board. Two hosts are connected through the 1021-TSN board, just like fig27 in and they have been synchronized through ptp4l. But I cannot send proper traffic on host1 via ‘isochron send --interface eth0 --dmac 00:04:9f:05:de:06 --priority 6 --vid 0 \ ······’. The ubuntu prompts me that the isochron command is not found. So I would like to ask how to install and use isochron, or how to send traffic with different priorities on host1.

vladimiroltean commented 3 years ago

When have you built the OpenIL image, and using what defconfig? In all defconfigs for the LS1021A-TSN board, the BR2_PACKAGE_QORIQ_TSN_SCRIPTS package is enabled.

crow1814 commented 3 years ago

I have used the default configuration on 1021-TSN board. I would like to ask how to install and use isochron on host1(ubuntu 16.04).

vladimiroltean commented 3 years ago
git clone https://github.com/vladimiroltean/tsn-scripts.git
git checkout isochron
make -C isochron
./isochron/isochron --help
vladimiroltean commented 3 years ago

Hey @jman-88, FWIW I spent some time today and figured out what was the problem. I was right about the source of the problem, but the port flushing procedure was slightly incorrect and therefore didn't work until I fixed it. I unplugged the cable quite a few tens of times now and traffic was still flowing. Sorry it took so long.