acooks / tn40xx-driver

Linux driver for tn40xx from Tehuti Networks
73 stars 52 forks source link

Problem with nvidia Jetson TX2 board #4

Open wsarang opened 6 years ago

wsarang commented 6 years ago

I want to run nvdia Jetson TX2 board with tn40xx driver. Kernal driver building is fine. when connected network, Kernal driver was crashed.

Below is My system information.

Hardware information board : nvidia jetson tx2

Kernel information Linux tegra-ubuntu 4.4.38 #47 SMP PREEMPT Fri Aug 24 16:57:06 KST 2018 aarch64 aarch64 aarch64 GNU/Linux

Git information

git remote : https://github.com/acooks/tn40xx-driver.git
branch     : release/tn40xx-001 
commit id  : c3b4acd011c749a7442c4ed1c0c4aa44cdd05a95

PCIe interface information

nvidia@tegra-ubuntu:~$ lspci -x
00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)
00: de 10 e5 10 06 00 10 00 a1 00 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00
20: f0 ff 00 00 01 58 01 58 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 84 01 00 00

01:00.0 Ethernet controller: Tehuti Networks Ltd. TN9510 10GBase-T/NBASE-T Ethernet Adapter
00: c9 1f 25 40 00 00 10 00 00 00 00 02 00 00 00 00
10: 0c 00 00 58 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 c9 1f 15 30
30: 00 00 00 00 50 00 00 00 00 00 00 00 84 01 00 00

When crashed, below is kernel log information

[ 2147.302751] Tehuti Network Driver, 0.3.6.16.1
[ 2147.307245] Supported phys : MV88X3120   QT2025 TLK10232 AQR105 MUSTANG
[ 2147.314745] tn40xx 0000:01:00.0: enabling device (0000 -> 0002)
[ 2147.320800] srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x0 mrrs 0x2
[ 2147.566598] PHY detected on port 1 ID=3A1B4A3 - AQR105 10Gbps 10GBase-T
[ 2155.234683] AQR105 FW ver: 2.b.e2
[ 2155.366908] fw 0xe
[ 2155.368973] eth1, Port A
[ 2155.370834] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 2155.377378] 1 1fc9:4025:1fc9:3015
[ 2155.380707] detected 1 cards, 1 loaded
[ 2155.484325] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 2155.490334] 8021q: adding VLAN 0 to HW filter on device eth1
[ 2159.528777] eth1 Link Up 1G
[ 2159.531729] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 2194.505184] BUG: Bad page state in process swapper/0  pfn:229f00
[ 2194.511272] page:ffffffbdc8a7c000 count:0 mapcount:0 mapping:          (null) index:0x0
[ 2194.519374] flags: 0x4000000000000200(arch_1)
[ 2194.523810] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[ 2194.530061] bad because of flags:
[ 2194.533417] flags: 0x200(arch_1)
[ 2194.536724] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm
[ 2194.557568] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.4.38 #47
[ 2194.564812] Hardware name: quill (DT)
[ 2194.568485] Call trace:
[ 2194.570965] [<ffffffc000089398>] dump_backtrace+0x0/0xe8
[ 2194.576293] [<ffffffc000089494>] show_stack+0x14/0x20
[ 2194.581363] [<ffffffc00034c4d0>] dump_stack+0xa0/0xc8
[ 2194.586430] [<ffffffc00017b51c>] bad_page+0xcc/0x118
[ 2194.591411] [<ffffffc00017f834>] get_page_from_freelist+0xa84/0xa88
[ 2194.597691] [<ffffffc00017fb74>] __alloc_pages_nodemask+0x134/0x9d0
[ 2194.604002] [<ffffffbffcf4e40c>] bdx_rx_get_page+0x6c/0x1b0 [tn40xx]
[ 2194.610387] [<ffffffbffcf505e4>] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx]
[ 2194.617376] [<ffffffbffcf50d8c>] bdx_poll+0x504/0xaf8 [tn40xx]
[ 2194.623224] [<ffffffc0009f4280>] net_rx_action+0x1d0/0x340
[ 2194.628726] [<ffffffc0000a844c>] __do_softirq+0x124/0x350
[ 2194.634134] [<ffffffc0000a88f8>] irq_exit+0x88/0xe0
[ 2194.639026] [<ffffffc0000f65f0>] __handle_domain_irq+0x60/0xb8
[ 2194.644870] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[ 2194.650280] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 2194.655090] [<ffffffc00082d610>] cpuidle_enter+0x18/0x20
[ 2194.660412] [<ffffffc0000e9214>] call_cpuidle+0x24/0x50
[ 2194.665648] [<ffffffc0000e94b0>] cpu_startup_entry+0x270/0x340
[ 2194.671495] [<ffffffc000b8ead8>] rest_init+0x88/0x98
[ 2194.676477] [<ffffffc00114196c>] start_kernel+0x390/0x3a4
[ 2194.681883] [<0000000080b95000>] 0x80b95000
[ 2194.686128] Disabling lock debugging due to kernel taint
[ 2204.563044] BUG: Bad page state in process swapper/0  pfn:24f7c0
[ 2204.569134] page:ffffffbdc93df000 count:0 mapcount:0 mapping:          (null) index:0x0
[ 2204.577214] flags: 0x4000000000000200(arch_1)
[ 2204.581658] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[ 2204.587872] bad because of flags:
[ 2204.591221] flags: 0x200(arch_1)
[ 2204.594530] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm
[ 2204.615353] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B      O    4.4.38 #47
[ 2204.622607] Hardware name: quill (DT)
[ 2204.626279] Call trace:
[ 2204.628757] [<ffffffc000089398>] dump_backtrace+0x0/0xe8
[ 2204.634087] [<ffffffc000089494>] show_stack+0x14/0x20
[ 2204.639158] [<ffffffc00034c4d0>] dump_stack+0xa0/0xc8
[ 2204.644225] [<ffffffc00017b51c>] bad_page+0xcc/0x118
[ 2204.649205] [<ffffffc00017f834>] get_page_from_freelist+0xa84/0xa88
[ 2204.655482] [<ffffffc00017fb74>] __alloc_pages_nodemask+0x134/0x9d0
[ 2204.661791] [<ffffffbffcf4e40c>] bdx_rx_get_page+0x6c/0x1b0 [tn40xx]
[ 2204.668177] [<ffffffbffcf505e4>] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx]
[ 2204.675167] [<ffffffbffcf50d8c>] bdx_poll+0x504/0xaf8 [tn40xx]
[ 2204.681016] [<ffffffc0009f4280>] net_rx_action+0x1d0/0x340
[ 2204.686514] [<ffffffc0000a844c>] __do_softirq+0x124/0x350
[ 2204.691922] [<ffffffc0000a88f8>] irq_exit+0x88/0xe0
[ 2204.696815] [<ffffffc0000f65f0>] __handle_domain_irq+0x60/0xb8
[ 2204.702656] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[ 2204.708065] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 2204.712874] [<ffffffc00082d610>] cpuidle_enter+0x18/0x20
[ 2204.718197] [<ffffffc0000e9214>] call_cpuidle+0x24/0x50
[ 2204.723433] [<ffffffc0000e94b0>] cpu_startup_entry+0x270/0x340
[ 2204.729279] [<ffffffc000b8ead8>] rest_init+0x88/0x98
[ 2204.734261] [<ffffffc00114196c>] start_kernel+0x390/0x3a4
[ 2204.739667] [<0000000080b95000>] 0x80b95000
[ 2216.473609] BUG: Bad page state in process swapper/0  pfn:220440
[ 2216.479712] page:ffffffbdc8811000 count:0 mapcount:0 mapping:          (null) index:0x0
[ 2216.487770] flags: 0x4000000000000200(arch_1)
[ 2216.492201] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[ 2216.498452] bad because of flags:
[ 2216.501807] flags: 0x200(arch_1)
[ 2216.505106] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm
[ 2216.525920] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B      O    4.4.38 #47
[ 2216.533175] Hardware name: quill (DT)
[ 2216.536848] Call trace:
[ 2216.539325] [<ffffffc000089398>] dump_backtrace+0x0/0xe8
[ 2216.544659] [<ffffffc000089494>] show_stack+0x14/0x20
[ 2216.549731] [<ffffffc00034c4d0>] dump_stack+0xa0/0xc8
[ 2216.554798] [<ffffffc00017b51c>] bad_page+0xcc/0x118
[ 2216.559776] [<ffffffc00017f834>] get_page_from_freelist+0xa84/0xa88
[ 2216.566054] [<ffffffc00017fb74>] __alloc_pages_nodemask+0x134/0x9d0
[ 2216.572364] [<ffffffbffcf4e40c>] bdx_rx_get_page+0x6c/0x1b0 [tn40xx]
[ 2216.578749] [<ffffffbffcf505e4>] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx]
[ 2216.585739] [<ffffffbffcf50d8c>] bdx_poll+0x504/0xaf8 [tn40xx]
[ 2216.591588] [<ffffffc0009f4280>] net_rx_action+0x1d0/0x340
[ 2216.597087] [<ffffffc0000a844c>] __do_softirq+0x124/0x350
[ 2216.602496] [<ffffffc0000a88f8>] irq_exit+0x88/0xe0
[ 2216.607387] [<ffffffc0000f65f0>] __handle_domain_irq+0x60/0xb8
[ 2216.613227] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[ 2216.618638] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 2216.623447] [<ffffffc00082d610>] cpuidle_enter+0x18/0x20
[ 2216.628770] [<ffffffc0000e9214>] call_cpuidle+0x24/0x50
[ 2216.634006] [<ffffffc0000e94b0>] cpu_startup_entry+0x270/0x340
[ 2216.639852] [<ffffffc000b8ead8>] rest_init+0x88/0x98
[ 2216.644834] [<ffffffc00114196c>] start_kernel+0x390/0x3a4
[ 2216.650241] [<0000000080b95000>] 0x80b95000
pheff commented 6 years ago

I am not sure if you have already seen these postings related to the Nvidia Jetson:

https://devtalk.nvidia.com/default/topic/965204/jetson-tx1/10g-ethernet-for-jetson-tx1-using-pci-e-x4/4

https://devtalk.nvidia.com/default/topic/1032474/jetson-tx2/10g-ethernet-for-jetson-tx2-using-pci-e-x4/

https://devtalk.nvidia.com/default/topic/1017757/jetson-tx2/10g-ethernet-/

You'll probably get better answers on the nvidia devtalk forum than here regarding this issue.

On Wed, Aug 29, 2018 at 8:34 PM wsarang notifications@github.com wrote:

I want to run nvdia Jetson TX2 board with tn40xx driver. Kernal driver building is fine. when connected network, Kernal driver was crashed.

Below is My system information.

Hardware information board : nvidia jetson tx2

Kernel information Linux tegra-ubuntu 4.4.38 #47 SMP PREEMPT Fri Aug 24 16:57:06 KST 2018 aarch64 aarch64 aarch64 GNU/Linux

Git information

git remote : https://github.com/acooks/tn40xx-driver.git branch : release/tn40xx-001 commit id : c3b4acd011c749a7442c4ed1c0c4aa44cdd05a95

PCIe interface information

nvidia@tegra-ubuntu:~$ lspci -x 00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1) 00: de 10 e5 10 06 00 10 00 a1 00 04 06 00 00 01 00 10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00 20: f0 ff 00 00 01 58 01 58 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 84 01 00 00

01:00.0 Ethernet controller: Tehuti Networks Ltd. TN9510 10GBase-T/NBASE-T Ethernet Adapter 00: c9 1f 25 40 00 00 10 00 00 00 00 02 00 00 00 00 10: 0c 00 00 58 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 c9 1f 15 30 30: 00 00 00 00 50 00 00 00 00 00 00 00 84 01 00 00

When crashed, below is kernel log information

[ 2147.302751] Tehuti Network Driver, 0.3.6.16.1 [ 2147.307245] Supported phys : MV88X3120 QT2025 TLK10232 AQR105 MUSTANG [ 2147.314745] tn40xx 0000:01:00.0: enabling device (0000 -> 0002) [ 2147.320800] srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x0 mrrs 0x2 [ 2147.566598] PHY detected on port 1 ID=3A1B4A3 - AQR105 10Gbps 10GBase-T [ 2155.234683] AQR105 FW ver: 2.b.e2 [ 2155.366908] fw 0xe [ 2155.368973] eth1, Port A [ 2155.370834] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 2155.377378] 1 1fc9:4025:1fc9:3015 [ 2155.380707] detected 1 cards, 1 loaded [ 2155.484325] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 2155.490334] 8021q: adding VLAN 0 to HW filter on device eth1 [ 2159.528777] eth1 Link Up 1G [ 2159.531729] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [ 2194.505184] BUG: Bad page state in process swapper/0 pfn:229f00 [ 2194.511272] page:ffffffbdc8a7c000 count:0 mapcount:0 mapping: (null) index:0x0 [ 2194.519374] flags: 0x4000000000000200(arch_1) [ 2194.523810] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set [ 2194.530061] bad because of flags: [ 2194.533417] flags: 0x200(arch_1) [ 2194.536724] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm [ 2194.557568] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.4.38 #47 [ 2194.564812] Hardware name: quill (DT) [ 2194.568485] Call trace: [ 2194.570965] [] dump_backtrace+0x0/0xe8 [ 2194.576293] [] show_stack+0x14/0x20 [ 2194.581363] [] dump_stack+0xa0/0xc8 [ 2194.586430] [] bad_page+0xcc/0x118 [ 2194.591411] [] get_page_from_freelist+0xa84/0xa88 [ 2194.597691] [] alloc_pages_nodemask+0x134/0x9d0 [ 2194.604002] [] bdx_rx_get_page+0x6c/0x1b0 [tn40xx] [ 2194.610387] [] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx] [ 2194.617376] [] bdx_poll+0x504/0xaf8 [tn40xx] [ 2194.623224] [] net_rx_action+0x1d0/0x340 [ 2194.628726] [] do_softirq+0x124/0x350 [ 2194.634134] [] irq_exit+0x88/0xe0 [ 2194.639026] [] handle_domain_irq+0x60/0xb8 [ 2194.644870] [] gic_handle_irq+0x64/0xc0 [ 2194.650280] [] el1_irq+0x80/0xf8 [ 2194.655090] [] cpuidle_enter+0x18/0x20 [ 2194.660412] [] call_cpuidle+0x24/0x50 [ 2194.665648] [] cpu_startup_entry+0x270/0x340 [ 2194.671495] [] rest_init+0x88/0x98 [ 2194.676477] [] start_kernel+0x390/0x3a4 [ 2194.681883] [<0000000080b95000>] 0x80b95000 [ 2194.686128] Disabling lock debugging due to kernel taint [ 2204.563044] BUG: Bad page state in process swapper/0 pfn:24f7c0 [ 2204.569134] page:ffffffbdc93df000 count:0 mapcount:0 mapping: (null) index:0x0 [ 2204.577214] flags: 0x4000000000000200(arch_1) [ 2204.581658] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set [ 2204.587872] bad because of flags: [ 2204.591221] flags: 0x200(arch_1) [ 2204.594530] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm [ 2204.615353] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B O 4.4.38 #47 [ 2204.622607] Hardware name: quill (DT) [ 2204.626279] Call trace: [ 2204.628757] [] dump_backtrace+0x0/0xe8 [ 2204.634087] [] show_stack+0x14/0x20 [ 2204.639158] [] dump_stack+0xa0/0xc8 [ 2204.644225] [] bad_page+0xcc/0x118 [ 2204.649205] [] get_page_from_freelist+0xa84/0xa88 [ 2204.655482] [] alloc_pages_nodemask+0x134/0x9d0 [ 2204.661791] [] bdx_rx_get_page+0x6c/0x1b0 [tn40xx] [ 2204.668177] [] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx] [ 2204.675167] [] bdx_poll+0x504/0xaf8 [tn40xx] [ 2204.681016] [] net_rx_action+0x1d0/0x340 [ 2204.686514] [] do_softirq+0x124/0x350 [ 2204.691922] [] irq_exit+0x88/0xe0 [ 2204.696815] [] handle_domain_irq+0x60/0xb8 [ 2204.702656] [] gic_handle_irq+0x64/0xc0 [ 2204.708065] [] el1_irq+0x80/0xf8 [ 2204.712874] [] cpuidle_enter+0x18/0x20 [ 2204.718197] [] call_cpuidle+0x24/0x50 [ 2204.723433] [] cpu_startup_entry+0x270/0x340 [ 2204.729279] [] rest_init+0x88/0x98 [ 2204.734261] [] start_kernel+0x390/0x3a4 [ 2204.739667] [<0000000080b95000>] 0x80b95000 [ 2216.473609] BUG: Bad page state in process swapper/0 pfn:220440 [ 2216.479712] page:ffffffbdc8811000 count:0 mapcount:0 mapping: (null) index:0x0 [ 2216.487770] flags: 0x4000000000000200(arch_1) [ 2216.492201] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set [ 2216.498452] bad because of flags: [ 2216.501807] flags: 0x200(arch_1) [ 2216.505106] Modules linked in: tn40xx(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay mttcan can_dev bcmdhd pci_tegra bluedroid_pm [ 2216.525920] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B O 4.4.38 #47 [ 2216.533175] Hardware name: quill (DT) [ 2216.536848] Call trace: [ 2216.539325] [] dump_backtrace+0x0/0xe8 [ 2216.544659] [] show_stack+0x14/0x20 [ 2216.549731] [] dump_stack+0xa0/0xc8 [ 2216.554798] [] bad_page+0xcc/0x118 [ 2216.559776] [] get_page_from_freelist+0xa84/0xa88 [ 2216.566054] [] alloc_pages_nodemask+0x134/0x9d0 [ 2216.572364] [] bdx_rx_get_page+0x6c/0x1b0 [tn40xx] [ 2216.578749] [] _bdx_rx_alloc_buffers+0x234/0x4d8 [tn40xx] [ 2216.585739] [] bdx_poll+0x504/0xaf8 [tn40xx] [ 2216.591588] [] net_rx_action+0x1d0/0x340 [ 2216.597087] [] do_softirq+0x124/0x350 [ 2216.602496] [] irq_exit+0x88/0xe0 [ 2216.607387] [] __handle_domain_irq+0x60/0xb8 [ 2216.613227] [] gic_handle_irq+0x64/0xc0 [ 2216.618638] [] el1_irq+0x80/0xf8 [ 2216.623447] [] cpuidle_enter+0x18/0x20 [ 2216.628770] [] call_cpuidle+0x24/0x50 [ 2216.634006] [] cpu_startup_entry+0x270/0x340 [ 2216.639852] [] rest_init+0x88/0x98 [ 2216.644834] [] start_kernel+0x390/0x3a4 [ 2216.650241] [<0000000080b95000>] 0x80b95000

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acooks/tn40xx-driver/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AcyRGXnvkq7zZIKv5FO7JbGhGrBBLG95ks5uV11IgaJpZM4WSt9z .

acooks commented 6 years ago

Thanks for those links, Pat.

The jetson posts all suggest that the issue is related to the SMMU and that the workaround is to disable the SMMU.

So here's a long shot guess: If the Tehuti is somehow using the wrong requester id in its PCI transactions, the SMMU will block its DMA. I've seen this before with Marvell's SATA controllers and worked around it with a driver quirk. If that wild guess is correct, we should be able to reproduce the issue on Intel hardware by enabling VT-d and could develop a workaround for it (in the Intel IOMMU driver. I don't know about the Jetson SMMU.)

pheff commented 6 years ago

For what it is worth, I am experiencing the same PAGE_FLAGS_CHECK_AT_PREP flag set with our custom Altera Cyclone V SoC board using a TN4010 MAC with a Marvell 88x3310P PHY running on the PCIe bus. I get the same result with the Altera Cyclone V SoC development board running a Trendnet NIC (tn4010 with 88x3310P PHY) as well. I am running Linux 4.9.78 (suggested version from Ley Foon Tan - Altera Linux PCIe driver maintainer). I am running the latest tn40 driver version. No problems when the interface is down or not linked but as soon as it links and starts receiving some packets then I start getting these messages. I was going to try the suggestion of commenting out the USE_PAGED_BUFFERS or define the RX_REUSE_BUFFERS and see if that makes any difference. In addition, I was searching for something on the altera dma or pcie side that I could do to eliminate this issue as well so looking for any suggestions here.

wsarang commented 6 years ago

Thanks all.

I am using JetPack 3.2.1 for nvidia TX2.

My solution is below:

diff --git a/tn40.h b/tn40.h
index 619234a..64dac0f 100644
--- a/tn40.h
+++ b/tn40.h
@@ -307,7 +307,7 @@ enum { IRQ_INTX, IRQ_MSI, IRQ_MSIX };
     ((coal) | ((coal_rc) << 15) | ((rxf_th) << 16) | ((pck_th) << 20))

 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 31)
-#define USE_PAGED_BUFFERS             1
+/*#define USE_PAGED_BUFFERS             1*/
 /*#define RX_REUSE_PAGES */
 #if defined(RX_REUSE_PAGES) && !defined(USE_PAGED_BUFFERS)
 #define USE_PAGED_BUFFERS
pheff commented 6 years ago

Thanks all for your comments.

I tried first and turned on the define RX_REUSE_BUFFERS in tn40.h and this cured the problem.

define RX_REUSE_PAGES

You will also have to go into tn40.c and comment out these lines for it to compile when using RX_REUSE_PAGES since tehuti driver writer doesn't seem to like to test their various compile defines to make sure they actually compile. //#if defined(TN40_DEBUG) int g_dbg=0; //#endif

So I will try this for a while and put it through some sustained network load before declaring victory but it looks positive so far.

acooks commented 6 years ago

Ok, I was completely off the mark with the SMMU DMA idea. This issue is purely a software thing and it relates to the memory allocation in the receive path.

If you disable USE_PAGED_BUFFERS (like @wsarang did), then you'll be using the skb allocator. That is the slowest allocator, but simple and probably the most mature.

If you enable RX_REUSE_PAGES (like @pheff did), then the tn40xx driver will attempt to reuse pages, instead of asking the main Linux memory allocator for a fresh page all the time. This functionality was introduced in the 0.3.6.16 version (it's mentioned in the release_notes) and is obviously the least mature option at this stage.

Whether it's a good idea for the tn40xx driver to take on the complexity of tracking and recycling pages is debatable.

The other issue is whether the page recycling is masking the problem by doing fewer allocations. The bug happens after the tn40xx driver loads, but I don't see a direct link between the tn40xx page allocations and the incorrect page flags (not yet, anyway). It looks more like the kernel's page allocator is saying, "here's a page, but while I was fetching it for you, I noticed that there's some unexpected/incorrect flags on it"

The backtrace is missing a piece of the call chain in the memory allocator (because static inline). The missing piece is this: get_page_from_freelist -> rmqueue -> rmqueue_pcplist -> __rmqueue_pcplist -> check_new_pcp -> check_new_page -> check_new_page_bad -> bad_page

This check is enabled by CONFIG_DEBUG_VM, so you can consider disabling that, or if you could send the bug report to the upstream mm developers and see what they say.

Do you see any kernel oops or panic or crash after these "BUG: Bad page state..." messages?

pheff commented 6 years ago

Yes, it takes quite a bit of time to get to a kernel panic and hang after receiving lots of these messages but it does get to that point.

pheff commented 6 years ago

I spent a good portion today running iperf tests (both tcp and udp) at 1Gb linespeed to a PC on the same subnet and got very good throughput rates on the cyclone V ARM and no kernel issues with the RX_REUSE_PAGES defined. I will follow up as I do more testing at 2.5/5/10Gb rates.

wsarang commented 6 years ago

I also turned on RX_REUSE_PAGES today. The nvidia Jetson TX2 was good working for long time(5~6 hour). Thanks all.

Bye the way, When shutdown, kernel was hanged.

INFO: rcu_preempt detected stalls on CPUs/tasks:
 0-...: (1 GPs behind) idle=abb/2/0 softirq=17044/17048 fqs=92
 (detected by 1, t=5464 jiffies, g=4398, c=4397, q=1)

After I simply added .shutdown = __exit_p(bdx_remove),, the shutdown is good.

I hope it helps.

acooks commented 6 years ago

version 0.3.6.17 was released on 10 October.

The .shutdown = __exit_p(bdx_remove) fix has been added, but RX_REUSE_PAGES is still disabled by default.