Microchip-Ethernet / EVB-KSZ9477

Repository for using Microchip EVB-KSZ9477 board. Product Supported: KSZ9477, KSZ9567, KSZ9897, KSZ9896, KSZ8567, KSZ8565, KSZ9893, KSZ9563, KSZ8563, LAN9646, Phys(KSZ9031/9131, LAN8770
76 stars 79 forks source link

iperf crash is seen : "BUG : Bad page state in process.... " using ksz9477 on imx6Solox2 SOM, kernel 4.14 #77

Open parthshah3690 opened 2 years ago

parthshah3690 commented 2 years ago

Hi all, @triha2work, @micreladmin, @Ravi-Hegde @davidcai-micrel @jeghub @RaymondKim @Aryz @bnielsen1965

I need your help. I am using ksz9477 chip on a custom HW running Linux 4.14, connected to FEC imx6 processor. I'm using the fec_main.c and fec patch from this git : https://github.com/Microchip-Ethernet/EVB-KSZ9477/tree/master/KSZ/linux-drivers/ksz9897/linux-4.14/drivers/net/ethernet/freescale

I have connected 2 custom HW in LAN and able to ping between both HW. image

To check the network performance, I am using iperf3. But as soon as client gets connected to iperf3 client, I am seeing a crash from kernel. iperfCrash.txt

When I disable CONFIG_KSZ_PTP configuration, I do not see any crash. However, I tried using TCP and UDP with different combinations, I see below results:

  1. When CONFIG_KSZ_PTP is disabled, no crash is observed with TCP as well as UDP. Observed 3.9% packet loss with 1000M bandwidth in case of UDP and few retransmissions with TCP. (Provided with -b option)
  2. While in other case when CONFIG_KSZ_PTP is enabled, no crash observed in UDP with 1000Mbps. In case of TCP, we changed bandwidth from 1M to more and observed crash with 3M bandwidth only. Few retransmissions are observed with 1M and 2M as well.

I first think about RAM issue, but I could see that sufficient RAM was available before/during the crash happens.

[ 395.007282] BUG: Bad page state in process swapper/0 pfn:86baf [ 395.013245] page:3186c4d7 count:-1 mapcount:0 mapping: (null) index:0x0 [ 395.019970] flags: 0x0() [ 395.022533] raw: 00000000 00000000 00000000 ffffffff ffffffff 00000000 9fb2e5f4 00000000 [ 395.030635] page dumped because: nonzero _count [ 395.035175] Modules linked in: cywdhd(O) mxc_dcic evbug [ 395.040450] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.14.200+g20245046a7a0 #1 [ 395.048990] Hardware name: Freescale i.MX6 SoloX (Device Tree) [ 395.054876] [<8010f2ec>] (unwind_backtrace) from [<8010ac4c>] (show_stack+0x10/0x14) [ 395.062650] [<8010ac4c>] (show_stack) from [<80a931a4>] (dump_stack+0x84/0x98) [ 395.069909] [<80a931a4>] (dump_stack) from [<801cd134>] (bad_page+0x114/0x144) [ 395.077164] [<801cd134>] (bad_page) from [<801cf6b0>] (get_page_from_freelist+0x320/0x8ec) [ 395.085461] [<801cf6b0>] (get_page_from_freelist) from [<801d0370>] (alloc_pages_nodemask+0xd8/0xc68) [ 395.094881] [<801d0370>] (alloc_pages_nodemask) from [<801d0fbc>] (page_frag_alloc+0x5c/0x150) [ 395.103695] [<801d0fbc>] (page_frag_alloc) from [<8088c298>] (netdev_alloc_skb+0xb8/0x118) [ 395.112160] [<8088c298>] (netdev_alloc_skb) from [<806289c8>] (fec_enet_rx_napi+0x284/0xcd8) [ 395.120803] [<806289c8>] (fec_enet_rx_napi) from [<8089ffb0>] (net_rx_action+0x11c/0x314) [ 395.129006] [<8089ffb0>] (net_rx_action) from [<801015e0>] (do_softirq+0xd8/0x230) [ 395.136779] [<801015e0>] (__do_softirq) from [<801307b0>] (irq_exit+0xbc/0x104) [ 395.144127] [<801307b0>] (irq_exit) from [<8016c2f4>] (handle_domain_irq+0x80/0xe8) [ 395.151984] [<8016c2f4>] (handle_domain_irq) from [<801014c4>] (gic_handle_irq+0x4c/0x90) [ 395.160356] [<801014c4>] (gic_handle_irq) from [<8010b98c>] (irq_svc+0x6c/0xa8) [ 395.167854] Exception stack(0x81001f40 to 0x81001f88) [ 395.172930] 1f40: 00000000 80e04044 1eaa8000 80118060 81000000 81003db8 81003d6c 8107a000 [ 395.181130] 1f60: 81003d40 81003d40 00000001 80f6ba30 00000001 81001f90 8010811c 80108120 [ 395.189320] 1f80: 60000013 ffffffff [ 395.192842] [<8010b98c>] (irq_svc) from [<80108120>] (arch_cpu_idle+0x38/0x3c) [ 395.200275] [<80108120>] (arch_cpu_idle) from [<80160cec>] (do_idle+0xb8/0x138) [ 395.207613] [<80160cec>] (do_idle) from [<80161014>] (cpu_startup_entry+0x18/0x1c) [ 395.215213] [<80161014>] (cpu_startup_entry) from [<80f00c68>] (start_kernel+0x39c/0x3b0) [ 395.223406] Disabling lock debugging due to kernel taint [ 395.229733] BUG: Bad page state in process swapper/0 pfn:86bdf [ 395.235691] page:041752be count:-1 mapcount:0 mapping: (null) index:0x0 [ 395.242415] flags: 0x0() [ 395.244977] raw: 00000000 00000000 00000000 ffffffff ffffffff 9fb2ebf4 9fb2ebf4 00000000 [ 395.253079] page dumped because: nonzero _count [ 395.257618] Modules linked in: cywdhd(O) mxc_dcic evbug [ 395.262892] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B O 4.14.200+g20245046a7a0 #1 [ 395.271431] Hardware name: Freescale i.MX6 SoloX (Device Tree) [ 395.277311] [<8010f2ec>] (unwind_backtrace) from [<8010ac4c>] (show_stack+0x10/0x14) [ 395.285082] [<8010ac4c>] (show_stack) from [<80a931a4>] (dump_stack+0x84/0x98) [ 395.292339] [<80a931a4>] (dump_stack) from [<801cd134>] (bad_page+0x114/0x144) [ 395.299593] [<801cd134>] (bad_page) from [<801cf6b0>] (get_page_from_freelist+0x320/0x8ec) [ 395.307889] [<801cf6b0>] (get_page_from_freelist) from [<801d0370>] (alloc_pages_nodemask+0xd8/0xc68) [ 395.317309] [<801d0370>] (alloc_pages_nodemask) from [<801d0fbc>] (page_frag_alloc+0x5c/0x150) [ 395.326124] [<801d0fbc>] (page_frag_alloc) from [<8088c298>] (netdev_alloc_skb+0xb8/0x118) [ 395.334590] [<8088c298>] (netdev_alloc_skb) from [<806289c8>] (fec_enet_rx_napi+0x284/0xcd8) [ 395.343232] [<806289c8>] (fec_enet_rx_napi) from [<8089ffb0>] (net_rx_action+0x11c/0x314) [ 395.351436] [<8089ffb0>] (net_rx_action) from [<801015e0>] (__do_softirq+0xd8/0x230) [ 395.359209] [<801015e0>] (do_softirq) from [<801307b0>] (irq_exit+0xbc/0x104) [ 395.366556] [<801307b0>] (irq_exit) from [<8016c2f4>] (handle_domain_irq+0x80/0xe8) [ 395.374417] [<8016c2f4>] (handle_domain_irq) from [<801014c4>] (gic_handle_irq+0x4c/0x90) [ 395.382789] [<801014c4>] (gic_handle_irq) from [<8010b98c>] (__irq_svc+0x6c/0xa8) [ 395.390284] Exception stack(0x81001f40 to 0x81001f88) [ 395.395360] 1f40: 00000000 80e04044 1eaa8000 80118060 81000000 81003db8 81003d6c 8107a000 [ 395.403558] 1f60: 81003d40 81003d40 00000001 80f6ba30 00000001 81001f90 8010811c 80108120 [ 395.411748] 1f80: 60000013 ffffffff [ 395.415267] [<8010b98c>] (__irq_svc) from [<80108120>] (arch_cpu_idle+0x38/0x3c) [ 395.422698] [<80108120>] (arch_cpu_idle) from [<80160cec>] (do_idle+0xb8/0x138) [ 395.430035] [<80160cec>] (do_idle) from [<80161014>] (cpu_startup_entry+0x18/0x1c) [ 395.437635] [<80161014>] (cpu_startup_entry) from [<80f00c68>] (start_kernel+0x39c/0x3b0)

Have you a solution or any idea for the problem?

jeghub commented 2 years ago

Hi,

To answer your question on #74 and #77, I have no solution sorry. Like you I had crash ONLY when enabling features like PTP, STP... so I disabled all of them.

I do not need PTP but I will need to use STP (#77) or multi_dev = 1 mode (#70). I had no the time to find a solution. I'm now using multi_dev = 0 mode and disable stp because it was not our priority. But I will have to find a solution soon. Please if you find something post your solution. I will do the same when I will work again on it.

triha2work commented 2 years ago

Try disable F_SG feature in the MAC driver to verify the problem. This greatly reduces the TCP transmit performance but we want to debug the problem first. Then we probably need to use the copy mechanism in the updated 5.4 driver.

parthshah3690 commented 2 years ago

Hi @triha2work ,

I have disabled F_SG feature (NETIF_F_SG - HW feature) from FEC driver, but still I am getting the same crash.

romatou18 commented 2 years ago

Hi @parthshah3690,

Would you mind sending on gist.com or here your dts file with the KZS9477 spi or i2c configuration ?

Kind Regards

parthshah3690 commented 2 years ago

Hi @romatou18

Please find attached reference dts file

DTS.txt

triha2work commented 2 years ago

Please try the 5.4 driver. It should be compatible to 5.3.

parthshah3690 commented 2 years ago

Hi @triha2work, I am using kernel 4.14 Past I did try porting Drivers from 5.4 to 4.14, but I am getting porting errors.

jeghub commented 2 years ago

Hi @parthshah3690 have you find a way to correct this issue?

I'm doing iperf test the same way you were doing on your first post, the server is on my custom board and client on a computer. With CONFIG_KSZ_PTP is disabled, I have no error with TCP and bandwith about 900mbits/s With UDP however, observed about 90% lost packet for 1000M test (option -b1G) image

On eth0 I can see a lot of packet error, all errors are for overrun image

I'm using iperf 3.1.3 ; For udp the command is iperf3 -c 192.168.3.182 -u -b1G

parthshah3690 commented 2 years ago

Hi @jeghub , No solution is available. I did a migration to linux 5.4 and took KSZ driver. Still it has few problems with buffers.

jeghub commented 2 years ago

Thank you for your quick answer @parthshah3690

Do you think it was worth it migrate to 5.4 ? Do you still have the same issues when PTP is enable? I was thinking to migrate but afraid to get stuck with the same issues.

parthshah3690 commented 2 years ago

Hi @jeghub, If you migrate to linux 5.4, you would not see the current error. But you will see warning with transmit queue is getting filled frequently. Please refer : https://community.nxp.com/t5/i-MX-Processors/I-MX6SoloX2-Linux-5-4-70-2-3-0-transmit-queue-0-timed-out/td-p/1368528