Kernel Oops - BUG() getting called

satva commented 5 years ago

Hi ,

mcp2210 was trying to do simple register read (triggered by spi_nor_read_id) and resulted in kcrash as shown below. MCP2210 was trying to process command - 0x42 (SPI_DATA_TRANSFER) and hit this issue at repeat-count # 3. Going by the trace, assert(0) was triggered as variable "body_size" was > 0 but "body" was NULL. (can see in the extra prints below). Any pointers please as bit new to SPI itself of which scenario lands at this case !

(Kernel version - 4.1.35-rt41)

usb 1-1.1: process_commands: running command bd2a3380 again (repeat_count = 3)
***process_commands,1587***
***process_commands,1635***
***1***
usb 1-1.1: submit_urbs:
***submit_urbs,2359***
******spi_submit_prepare,657******
******mcp2210_init_msg,1094,6,4,CMD:42******
******mcp2210_init_msg,1104,6,4******
------------[ cut here ]------------
kernel BUG at buildroot/mcp2210_driver/mcp2210.h:1105!
Internal error: Oops - BUG: 0 [#1] SMP THUMB2
Modules linked in: mcp2210
CPU: 0 PID: 18 Comm: kworker/0:1 Not tainted 4.1.35-rt41 #198
Hardware name: Freescale LS1021A
Workqueue: events mcp2210_spi_probe_async [mcp2210]
task: bf1a9300 ti: bf20a000 task.ti: bf20a000
PC is at spi_submit_prepare+0x229/0x2d4 [mcp2210]
LR is at spi_submit_prepare+0x22a/0x2d4 [mcp2210]
pc : [<7f802fde>]    lr : [<7f802fdf>]    psr: 600e01b3
sp : bf20b958  ip : 0000001c  fp : bd1a6000
r10: bd2a3380  r9 : 00000004  r8 : bccef040
r7 : 00000000  r6 : 00000006  r5 : bd1a6000  r4 : bd2a3380
r3 : 00000001  r2 : 80c06404  r1 : 800e0193  r0 : 00000025
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment kernel
Control: 70c5387d  Table: bd22f400  DAC: fffffffd
Process kworker/0:1 (pid: 18, stack limit = 0xbf20a210)
Stack: (0xbf20b958 to 0xbf20c000)
b940:                                                       00000004 00000042
b960: 00000001 bd2a3380 bd1a6000 8004a847 7f807740 bf20b994 3ebe3000 802b8ee3
b980: 7f807740 bd1a6000 00000000 7f8094c8 00000001 00000000 00000001 7f80082f
b9a0: bd2a3380 00000003 7f80864f 7f8060e3 00000001 bccef040 00000002 bd1a6000
b9c0: 00000000 bd2a3380 00000001 00000000 600e0193 7f806188 00000002 7f800e6d
b9e0: bd2a3400 01020000 00204000 7f800b79 bd2a3400 000e0193 00000000 00000000
ba00: 00000001 bd2a3400 bd21a000 801e9b75 bf02c000 bd2e4bc0 bd7fea30 8020018b
ba20: bd21a0f4 bf20ba54 00000000 8017b387 bd21a0f4 bccda000 00000001 00000000
ba40: 00000003 8001c275 bf811840 bd7ec130 00000001 00000000 00000000 bf7d9880
ba60: bf1a9348 bf02c000 bd7fea30 bf20bc28 00000031 00000000 00000000 00000000
ba80: bf20a000 802005b5 80bfa140 bf02c02c bd7fe9e0 80bf6e40 bf7d6570 bd006640
baa0: 00000031 bf20bc28 00000031 00000000 00000000 00000000 bf20a000 801e9523
bac0: 801e9505 8004b501 80bf6e00 800563d3 bf1309c0 80025140 00000006 bf1309c0
bae0: bf130a28 bf20bc28 00000031 bf00c800 00000001 00000000 bf10d549 8004b651
bb00: bf1309c0 80c16140 bf20bc28 8004d73d 8004d6d5 00000031 80bf4cb8 8004af25
bb20: 00000000 8004b119 bf20bb58 c0802000 80bfc0e0 bf20bb58 bf20bb8c 80c43ac0
bb40: 80bfa080 8000927f 80025140 400e0133 ffffffff 802bce9b 00000001 00000008
bb60: 00000000 00000000 bf20a000 80bfa140 00000002 00000011 80c43ac0 80bfa080
bb80: 00000000 bf10d549 00000000 bf20bba0 800250fb 80025140 400e0133 ffffffff
bba0: 00000006 bf7dc440 04208060 bf20bba0 0000000a 00000011 ffff95f4 00000000
bbc0: 80bfa140 80226d85 80226d61 00000000 80bf4cb8 00000000 00000011 bf00c800
bbe0: 00000001 00000000 bf10d549 8002578b 00000000 8004b11d bf20bc28 c0802000
bc00: 80bfc0e0 bf20bc28 bf20bc5c bd2e0200 00000000 8000927f 802bc916 000e0033
bc20: ffffffff 802bce9b bcd0b99c 600e0013 80c06404 00000001 00000000 bcd0b99c
bc40: 600e0013 bf20bcc0 bd2e0200 00000000 00000000 bf10d549 00000018 bf20bc70
bc60: 801c0673 802bc916 000e0033 ffffffff 00000001 bd2e0200 bf20bcc0 bcd0b800
bc80: 801c0749 801c1639 00000000 00000000 bf20bc90 bf20bc90 3ebe3000 bf11a541
bca0: bf11a540 00000006 bf20bdba bd2e0200 bf20bd97 00000000 bf10d549 801c1775
bcc0: bf20bd24 bf20bd60 bd2e0200 00000000 801c05b1 bf20bc88 0000000e 00000001
bce0: ffffff8d 00000000 00000000 00000000 bf11a540 00000000 00000001 00000000
bd00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000802
bd20: 00b71b00 bf20bd60 bf20bcc0 00000000 bf11a541 00000006 00000000 00000000
bd40: 00000000 00000000 00000000 00000000 00000000 00000000 00000810 00b71b00
bd60: bf20bcc0 bf20bd24 803db877 bf20bd97 bd2e0200 bf20bdba 00000006 00000000
bd80: 803db8a9 801b9683 00000006 802b8ee3 803db877 9f20bda4 00000387 801b9651
bda0: 802ede44 bcd0b418 bd2e0390 801beee3 bcd0b418 800d0013 80c06404 00000001
bdc0: bcd0b418 802ede44 bd2e0200 801bf8e5 00000300 801847f7 bd2e033c 80184803
bde0: 801845cb 00000000 bd2e0200 bd2e0390 00000000 bcd0b418 00000000 80c69028
be00: bf10d549 801b9905 bd2e0200 00000001 80c251f0 bd2e0200 80c6904c 80c25200
be20: 0000000c 801c04f5 801c04bd bd2e0200 00000000 80182b25 00000000 bd2e0200
be40: 80182c0d bcd0b800 00000000 80181b7f bf10eb70 bd7da5b8 bd2e0200 80c28d5c
be60: bd2e0234 80182a55 bd2e0200 80c28d5c bd2e0200 80182445 bd2e0200 00000000
be80: bd2e0208 8018135d bd2e0200 800f0013 00000029 0000002a bd2e0200 00000000
bea0: bcd0b800 bcd0a068 bd1a6390 bd2e4840 bcd0b800 801c0aff 00000001 bd2e0200
bec0: bd1a6000 00000000 7f8094c8 7f80327d 80c55c80 80055bc5 bd7f5e10 bd1a6390
bee0: bd2e0390 bf10d400 00000004 bf1f5c80 bd2e4840 bf7d93c0 00000000 bf7dc800
bf00: bf7d93d4 00000000 00000008 80031089 bf1f5c80 bd2e4840 bf7dcc58 bf1f5c80
bf20: bf7d93c0 bf7d93c0 80bfa140 bf1f5c98 bf7d93d4 00000000 00000008 800315d5
bf40: bf1a9300 bf1ff240 00000000 bf1f5c80 800313e9 00000000 00000000 00000000
bf60: 00000000 800340c1 00000000 00000000 00000000 bf1f5c80 00000000 00000000
bf80: bf20bf80 bf20bf80 00000000 00000000 bf20bf90 bf20bf90 00000000 bf1ff240
bfa0: 80034031 00000000 00000000 80012b61 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[<7f802fde>] (spi_submit_prepare [mcp2210]) from [<7f80082f>] (process_commands+0x412/0x5a8 [mcp2210])
[<7f80082f>] (process_commands [mcp2210]) from [<7f800e6d>] (complete_urb+0x2f4/0x334 [mcp2210])
[<7f800e6d>] (complete_urb [mcp2210]) from [<801e9b75>] (__usb_hcd_giveback_urb+0x45/0x90)
[<801e9b75>] (__usb_hcd_giveback_urb) from [<8020018b>] (handle_tx_event+0x80f/0x844)
[<8020018b>] (handle_tx_event) from [<802005b5>] (xhci_irq+0x3f5/0x4f0)
[<802005b5>] (xhci_irq) from [<801e9523>] (usb_hcd_irq+0x1f/0x2a)
[<801e9523>] (usb_hcd_irq) from [<8004b501>] (handle_irq_event_percpu+0x6d/0x184)
[<8004b501>] (handle_irq_event_percpu) from [<8004b651>] (handle_irq_event+0x39/0x4c)
[<8004b651>] (handle_irq_event) from [<8004d73d>] (handle_fasteoi_irq+0x69/0xbc)
[<8004d73d>] (handle_fasteoi_irq) from [<8004af25>] (generic_handle_irq+0x19/0x22)
[<8004af25>] (generic_handle_irq) from [<8004b119>] (__handle_domain_irq+0x59/0x7c)
[<8004b119>] (__handle_domain_irq) from [<8000927f>] (gic_handle_irq+0x2f/0x48)
[<8000927f>] (gic_handle_irq) from [<802bce9b>] (__irq_svc+0x3b/0x5c)
Exception stack(0xbf20bb58 to 0xbf20bba0)
bb40:                                                       00000001 00000008
bb60: 00000000 00000000 bf20a000 80bfa140 00000002 00000011 80c43ac0 80bfa080
bb80: 00000000 bf10d549 00000000 bf20bba0 800250fb 80025140 400e0133 ffffffff
[<802bce9b>] (__irq_svc) from [<80025140>] (__do_softirq+0x5c/0x218)
[<80025140>] (__do_softirq) from [<8002578b>] (irq_exit+0x53/0xa4)
[<8002578b>] (irq_exit) from [<8004b11d>] (__handle_domain_irq+0x5d/0x7c)
[<8004b11d>] (__handle_domain_irq) from [<8000927f>] (gic_handle_irq+0x2f/0x48)
[<8000927f>] (gic_handle_irq) from [<802bce9b>] (__irq_svc+0x3b/0x5c)
Exception stack(0xbf20bc28 to 0xbf20bc70)
bc20:                   bcd0b99c 600e0013 80c06404 00000001 00000000 bcd0b99c
bc40: 600e0013 bf20bcc0 bd2e0200 00000000 00000000 bf10d549 00000018 bf20bc70
bc60: 801c0673 802bc916 000e0033 ffffffff
[<802bce9b>] (__irq_svc) from [<802bc916>] (_raw_spin_unlock_irqrestore+0x1c/0x1e)
[<802bc916>] (_raw_spin_unlock_irqrestore) from [<801c0673>] (spi_async_locked+0x41/0x4e)
[<801c0673>] (spi_async_locked) from [<801c1639>] (__spi_sync+0xef/0x17e)
[<801c1639>] (__spi_sync) from [<801c1775>] (spi_write_then_read+0xa3/0xfe)
[<801c1775>] (spi_write_then_read) from [<801b9683>] (m25p80_read_reg+0x33/0x64)
[<801b9683>] (m25p80_read_reg) from [<801beee3>] (spi_nor_read_id+0x27/0xc0)
[<801beee3>] (spi_nor_read_id) from [<801bf8e5>] (spi_nor_scan+0xc9/0x4c4)
[<801bf8e5>] (spi_nor_scan) from [<801b9905>] (m25p_probe+0x8d/0xc8)
[<801b9905>] (m25p_probe) from [<801c04f5>] (spi_drv_probe+0x39/0x48)
[<801c04f5>] (spi_drv_probe) from [<80182b25>] (driver_probe_device+0x9d/0x184)
[<80182b25>] (driver_probe_device) from [<80181b7f>] (bus_for_each_drv+0x47/0x52)
[<80181b7f>] (bus_for_each_drv) from [<80182a55>] (device_attach+0x41/0x58)
[<80182a55>] (device_attach) from [<80182445>] (bus_probe_device+0x1d/0x64)
[<80182445>] (bus_probe_device) from [<8018135d>] (device_add+0x2b9/0x354)
[<8018135d>] (device_add) from [<801c0aff>] (spi_add_device+0x8f/0xcc)
[<801c0aff>] (spi_add_device) from [<7f80327d>] (mcp2210_spi_probe_async+0xf4/0x190 [mcp2210])
[<7f80327d>] (mcp2210_spi_probe_async [mcp2210]) from [<80031089>] (process_one_work+0x17d/0x2c4)
[<80031089>] (process_one_work) from [<800315d5>] (worker_thread+0x1ed/0x2dc)
[<800315d5>] (worker_thread) from [<800340c1>] (kthread+0x91/0xa0)
[<800340c1>] (kthread) from [<80012b61>] (ret_from_fork+0x11/0x30)
Code: 4929 482a f2b5 df73 (de02) f1d6
---[ end trace ad87c7587c4410f5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
---[ end Kernel panic - not syncing: Fatal exception in interrupt

daniel-santos commented 5 years ago

Hello. You're going to have to give a bit more specific information. I don't recognize the output messages with asterisks -- I presume these are printks you've added? If so, what do the numbers mean. Some other helpful things would be to configure the kernel with timestamps on printk and full debug information as well as all symbols (if you 're board has the memory). I hope this is easily reproducable and changing those doesn't "fix" it (meaning it's undefined behaviour somewhere).

You have either changed the sources or are using a commit other than the HEAD because I don't have a BUG_ON on line 1105 of mcp2210.h (I just pushed a change, but that's in the userspace tool.)

At the very least, I need to know the real line of code where it blew up and you've modified the driver. I suggest you push your modifications to a public repo somewhere if you want me to try to help.

daniel-santos commented 5 years ago

OK, so I have a sinking suspicion that you've forgotten to initialize the tx_buf of your struct spi_transfer. I just pushed a commit that better validates this -- I had validation prior, but it was only enabled if CONFIG_MCP2210_DEBUG was enabled and it was wrong anyway, because it only failed when both tx and rx buffers were NULL.

Do a pull and give it a try. You should get a BUG_ON(!xfer->tx_buf) in queue_msg this time. Please let me know.

satva commented 5 years ago

Thanks Daniel for the response.

Exact line of code triggering the K crash :

   printk(KERN_ERR "******%s,%d,%d,%d,CMD:%x******\n",__FUNCTION__,__LINE__,body_size,sizeof(body_size),cmd);
        if(__builtin_constant_p(body_size))
                BUILD_BUG_ON(body_size > sizeof(msg->body.raw));
        else
                BUG_ON(body_size > sizeof(msg->body.raw));
        if (body){
                printk(KERN_ERR "******%s,%d,%d,%d******\n",__FUNCTION__,__LINE__,body_size,sizeof(body_size));
                memcpy(msg->body.raw, body, body_size);
        }
        else if (body_size){
                printk(KERN_ERR "******%s,%d,%d,%d******\n",__FUNCTION__,__LINE__,body_size,sizeof(body_size));
                BUG();
                printk(KERN_ERR "******%s,%d******\n",__FUNCTION__,__LINE__);
        }

BUG() was getting triggered in the "else if (body_size)" condition happening in mcp2210_init_msg() This was happening even with CONFIG_MCP2210_DEBUG=y (& CONFIG_MCP2210=m)

Please let me know if I need to provide any more information !

daniel-santos commented 5 years ago

Thanks for the information. Please do a git fetch and rebase your changes on top of the new upstream master branch. I think you'll instead get a different BUG in queue_msg on line 414 of mcp2210-spi.c instead, this one: BUG_ON(!xfer->tx_buf). If I'm right then it will help you find your problem more easily, but if I'm wrong then let me know so I can look deeper.

daniel-santos commented 5 years ago

I just saw your second message. I guess you deleted it? Anyway, this is a BUG instead of returning an error code so that it gets discovered during development and not after deployment, since this shouldn't happen through any normal usage of the driver.

It's hard to always be certain how to treat each exceptional condition, but when it's certain that it can only happen due to a bug somewhere, then BUG/BUG_ON are appropriate. But if you look at queue_msg you'll see that it checks for a lot of conditions that it can't handle and returns and error code along with printing a message to the kernel log instead of a BUG_ON -- these are generally conditions that can happen through normal use, although there may be a few that I should also use BUG_ON for, like an empty transfer list, etc.

satva commented 5 years ago

Thanks Daniel for response. Reason to delete 2nd message was , thought to give myself more time to understand the context more better.

Basically, it was hitting BUG_ON(!xfer->tx_buf); at second iteration of linked-list. Trying to read , JEDEC ID of chip and hence two messages were appended to queue.

snippet from drivers/spi/spi.c (spi_write_then_read)

spi_message_init(&message);
        memset(x, 0, sizeof(x));
        if (n_tx) {
                x[0].len = n_tx;
                spi_message_add_tail(&x[0], &message);
        }
        if (n_rx) {
                x[1].len = n_rx;
                spi_message_add_tail(&x[1], &message);
        }

        memcpy(local_buf, txbuf, n_tx);
        x[0].tx_buf = local_buf;
        x[1].rx_buf = local_buf + n_tx;

Sending a read ID command of size 1 byte (addresss , here it is 0x9f) and expecting response of size 6 bytes (max size of JEDEC ID). Expecting txbuf not NULL on first iteration of linked list makes sense but on second iteration, it was NULL as expected and only rx_buf will be there. Not sure, what I am missing here !

Do the mcp2210 address spi_write_then_read() call !? Where first message was meant for transmit and second for to receive . Another point to mention was , message undergoes through __spi_validate as part of function flows & no error returns .

daniel-santos / mcp2210-linux

Kernel Oops - BUG() getting called #36