dnsmasq: Kernel unaligned instruction access

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
dnsmasq stops working after some amount of time
(it happened to me earlier too)

What is the expected output? What do you see instead?
Kernel unaligned instruction access[#1]:
Cpu 0
$ 0   : 00000000 10009c00 00000000 00800000
$ 4   : 00000000 fffffffc 00000000 00000000
$ 8   : 7f91ca74 0000bd36 001f0000 00000001
$12   : 80f8a9b8 00000002 006d6f63 00430000
$16   : 0000000c 802eb430 802eb1c0 00000008
$20   : 0000000a 802f6ec0 80981f30 802f6ec0
$24   : 00433c08 7709b630                  
$28   : 80980000 80981ef0 7f91cae0 004206ff
Hi    : ebab69dc
Lo    : 6e3a8687
epc   : 004206ff 0x4206fe     Tainted: P       
ra    : 004206ff 0x4206fe
Status: 10009c03    KERNEL EXL IE 
Cause : 00000010
BadVA : 004206ff
PrId  : 00029006
Modules linked in: sha256 aes dm_crypt dm_mod tun usb_storage sd_mod scsi_mod 
usblp uhci_hcd ehci_hcd usbcore nf_nat_ftp nf_conntrack_ftp pppol2tp pppox 
wl(P) et(P) igs(P) emf(P)
Process dnsmasq (pid: 588, threadinfo=80980000, task=80f8a7e0)
Stack : 00440988 0000000a 7f91cb74 7f91cb74 7f91cbf4 8000fcec 00000006 80f8a9b8
        00000010 8008a2e0 00000000 004409c8 00440a08 004409c8 7f91cae0 800089a0
        01200012 00000000 00000000 00000000 7f91ca74 7f91ca68 00000000 10009c00
        00441fc8 07ffffff 004206ff 004206ff 7f91cb28 00000000 00000000 80000010
        8008a238 fffffff0 00000002 00000002 006d6f63 00430000 00440a08 004409c8
        ...
Call Trace:
[<8000fcec>] do_cpu+0x198/0x3c4
[<8008a2e0>] sys_close+0xa8/0xd8
[<800089a0>] ret_from_exception+0x0/0x24
[<8008a238>] sys_close+0x0/0xd8

Code: 73656972  206f7420  206c6c61 <76726573> 2e737265  65730000  743c3a74  
2c3e6761  74706f3c 

What version of the product are you using?
r4525

Please provide any additional information below.

Device is wl500gP v1 bundled with BCM43222 card.

There is no additional options in the build except: I removed vsftpd, libjpeg, 
rcamd, samba, nfs, ipv6, snmp support, IGMPproxy, RIP/OSPF listener support, 
LLTD responder,NTFS-3G support.

Using SLUB allocator (it's default) + removed TCP Congestion.

also I have this line in my post-boot file:
echo 6500 > /proc/sys/vm/min_free_kbytes

Never experienced before this issue on older 4298 configuration. Could it be 
related to updated dnsmasq?

Original issue reported on code.google.com by spameden on 15 Aug 2012 at 2:45

GoogleCodeExporter commented 9 years ago

It looks like to be compiler problem and/or dnsmasq problem.

Will downgrade of dnsmasq help?

Original comment by lly.dev on 15 Aug 2012 at 4:19

GoogleCodeExporter commented 9 years ago

I'm using hndtools-mipsel-uclibc-4.5.4-K26 from 
http://code.google.com/p/wl500g/downloads/detail?name=hndtools-mipsel-uclibc-4.5
.4-K26-x86_64-r4460.tar.bz2

I'm not sure - if it reproduces again I will report back with additional 
details.

Original comment by spameden on 15 Aug 2012 at 4:27

GoogleCodeExporter commented 9 years ago

Hi, it just happened again:

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc 
== 00000000, ra == 00000000
Oops[#2]:
Cpu 0
$ 0   : 00000000 10009c00 00000000 00800000
$ 4   : 00000000 fffffffc 00000000 00000000
$ 8   : 7fb80834 0000bd36 001f0000 00000001
$12   : 814b11d8 00000080 00000040 00430000
$16   : 00000000 c0b70e00 c0b70e00 00000008
$20   : 00465224 802f6ec0 80f01f30 802f6ec0
$24   : 00433c08 772ff630                  
$28   : 80f00000 80f01ef0 00000000 00000000
Hi    : 00000058
Lo    : 00465224
epc   : 00000000 0x0     Tainted: P      D
ra    : 00000000 0x0
Status: 10009c03    KERNEL EXL IE 
Cause : 00000008
BadVA : 00000000
PrId  : 00029006
Modules linked in: netconsole sha256 aes dm_crypt dm_mod tun usb_storage sd_mod 
scsi_mod usblp uhci_hcd ehci_hcd usbcore nf_nat_ftp nf_conntrack_ftp pppol2tp 
pppox wl(P) et(P) igs(P) emf(P)
Process dnsmasq (pid: 865, threadinfo=80f00000, task=814b1000)
Stack : 0000003c 00437a40 00000001 00430000 00430000 8000fcec 80f00000 814b11d8
        00430000 0040247c 00000000 00000058 00465224 00439190 00000000 800089a0
        01200012 00000000 00000000 00000000 7fb80a04 7fb809f8 00000000 10009c00
        00000000 00000000 00465224 00466932 00465224 00000000 ffffffa1 00000088
        00000005 00000001 000000c0 00000080 00000040 00430000 00465224 00439190
        ...
Call Trace:
[<8000fcec>] do_cpu+0x198/0x3c4
[<800089a0>] ret_from_exception+0x0/0x24

Code: (Bad address in epc)

Original comment by spameden on 18 Aug 2012 at 9:02

GoogleCodeExporter commented 9 years ago

Unfortunately, I can't neither reproduce this bug, nor find piece of code from 
Oops provided inside dnsmasq binary.

Try to downgrade dnsmasq and remove router overclocking if it was done.

Original comment by lly.dev on 19 Aug 2012 at 5:16

GoogleCodeExporter commented 9 years ago

1. I do not use overclocking
2. I will try to downgrade

Thanks.

Original comment by spameden on 19 Aug 2012 at 9:43

GoogleCodeExporter commented 9 years ago

CPU 0 Unable to handle kernel paging request at virtual address 00000080, epc 
== 80013704, ra == 800089a0
Oops[#3]:
Cpu 0
$ 0   : 00000000 10001c00 00445ec0 00000000
$ 4   : 00030001 00000000 40000080 00000000
$ 8   : 10001c00 1000001e 00fdd376 00000000
$12   : 00000000 f26c9b26 003f74dd 0001e848
$16   : 00000080 00444128 00000000 00000001
$20   : 00000000 00000000 802eb32c 802f0000
$24   : 00433c08 76fa1630                  
$28   : 00444000 00443fb0 fffffbff 800089a0
Hi    : 00000000
Lo    : 0000016c
epc   : 80013704 do_page_fault+0x54/0x370     Tainted: P      D
ra    : 800089a0 ret_from_exception+0x0/0x24
Status: 10001c02    KERNEL EXL 
Cause : 00000008
BadVA : 00000080
PrId  : 00029006
Modules linked in: netconsole sha256 aes dm_crypt dm_mod tun usb_storage sd_mod 
scsi_mod usblp uhci_hcd ehci_hcd usbcore nf_nat_ftp nf_conntrack_ftp pppol2tp 
pppox wl(P) et(P) igs(P) emf(P)
Process zonaws.com (pid: 0, threadinfo=00442000, task=00441f80)
Stack : 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000
Call Trace:
[<80013704>] do_page_fault+0x54/0x370

Code: 24840001  00c3182b  00a0a821 <8e530080> 14600066  afa40024  8f830014  
3c02efff  3442ffff 
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 5 seconds..

it's either related to SWAP or something really bug in the firmware.. do you 
patch anything related to swap?

Original comment by spameden on 19 Aug 2012 at 6:11

GoogleCodeExporter commented 9 years ago

the last fault resulted into reboot

Original comment by spameden on 19 Aug 2012 at 6:12

GoogleCodeExporter commented 9 years ago

Since we have to use ancient 2.6.22 kernel, it can contain bugs, even we done 
much backports from upstream.

Your self-compiled firmware uses SLAB or SLUB? Are you able to perform tests 
with swap turned off?

Original comment by lly.dev on 20 Aug 2012 at 6:17

GoogleCodeExporter commented 9 years ago

SLUB, default.

reboot happened with swap turned on.

I just turned it off - will see if it reproduces again.

Original comment by spameden on 20 Aug 2012 at 9:41

GoogleCodeExporter commented 9 years ago

Mem: 21724K used, 7580K free, 0K shrd, 404K buff, 2920K cached
CPU:   0% usr  34% sys   0% nic   0% idle  63% io   0% irq   0% sirq
Load average: 2.57 2.30 2.42 3/42 1025

I'm having quite high sys / io usage.

There is no network activity and it's weird I guess it's related to 
/proc/sys/vm/min_free_kbytes

Original comment by spameden on 20 Aug 2012 at 11:03

GoogleCodeExporter commented 9 years ago

Same issue here.

1.9.2.7-rtn-r4502
WL500W
Overclocked!@300MHz/RAM 128MB.
echo "16384" > /proc/sys/vm/min_free_kbytes

Kernel unaligned instruction access[#1]:
Cpu 0
$ 0   : 00000000 00001025 00000000 00000024
$ 4   : 00000000 fffffffc 00000000 00000000
$ 8   : 7fd4ccac 0000bd36 001f0000 00000001
$12   : 825355c8 00000100 00000400 00000007
$16   : 00000000 00000000 00000000 00000008
$20   : 00000001 8032fb40 852a9f30 8032fb40
$24   : 00000002 77428730
$28   : 852a8000 852a9ef0 00436848 ffffffff
Hi    : 00000001
Lo    : 00000000
epc   : ffffffff 0xfffffffe     Tainted: P
ra    : ffffffff 0xfffffffe
Status: 1000bc03    KERNEL EXL IE
Cause : 00000010
BadVA : ffffffff
PrId  : 00029006
Modules linked in: sata_mv sg mmc_block sdhci ahci usbserial mmc_core libata 
raid_class raid1 md_mod ntfs usb_storage sd_mod scsi_mod usblp ehci_hcd usbcore 
xt_recent nf_nat_ftp nf_conntrack_ftp wl(P) et(P) igs(P) emf(P)
Process htop (pid: 1083, threadinfo=852a8000, task=825353f0)
Stack : 00440370 004417a0 7fd4ce58 00434f78 00000011 8000fcec 00000000 825355c8
        0000000b ffffffff 00000000 77497fab 00200800 00000000 00436848 800089a0
        00000014 00000000 7fd4d378 00000001 7fd4d668 00000000 00000000 00001025
        00420000 00421d8c 00412db8 00000020 00000002 00000020 00000002 774580a0
        00000807 00000800 00000200 00000100 00000400 00000007 00200800 00000000
        ...
Call Trace:
[<8000fcec>] do_cpu+0x198/0x3c4
[<800089a0>] ret_from_exception+0x0/0x24

Code: (Bad address in epc)

It happend with `htop` and `dnsmasq` as well.

Regards,
Rumen

Original comment by rumench...@gmail.com on 20 Aug 2012 at 1:00

GoogleCodeExporter commented 9 years ago

glad i'm not alone :)

Original comment by spameden on 20 Aug 2012 at 1:34

GoogleCodeExporter commented 9 years ago

All Oops, posted above, shows that there present some bug in kernel VM 
subsystem only, nothing more. 

Deeper analysis requires kernel debug, so all comments like "Mee too" are 
useless...

Original comment by lly.dev on 20 Aug 2012 at 2:03

GoogleCodeExporter commented 9 years ago

Unfortunately "Deeper analysis requires kernel debug" is beyond my knowledge.
That's the reason why I'm here.

If/when I have any findings , I`ll post them here.

Long live and prosper. _\\//

Original comment by rumench...@gmail.com on 20 Aug 2012 at 2:26

GoogleCodeExporter commented 9 years ago

If you could provide some instructions how to achieve proper debugging 
environment I'd be grateful.

Original comment by spameden on 20 Aug 2012 at 3:13

GoogleCodeExporter commented 9 years ago

Unfortunately, I'm not so experienced in kernel hacking to write step-by-step 
instruction.

First of all, I don't see answer to my question - is it reproduced with swap 
turned off?

My suggestion for the starting point - is to enable DEBUG_KERNEL & DEBUG_SLAB 
(for SLAB) of SLUB_DEBUG (for SLUB).

Original comment by lly.dev on 4 Oct 2012 at 7:17

GoogleCodeExporter commented 9 years ago

Sorry, I gave up on this and switched back on r3323. It works uber stable 
(11:29:56 up 20 days, 13:56, load average: 0.13, 0.13, 0.14).

The only issue I have under load is pppd killed with SIGNAL 15. 
(http://code.google.com/p/wl500g/issues/detail?id=270).

I tried debugging before but didn't succeed. Not sure if it happens with SWAP 
turned off.

Original comment by spameden on 4 Oct 2012 at 7:31

GoogleCodeExporter commented 9 years ago

Unreproducible. Probably, problem with swap, but requires reproducible 
test-case.

Original comment by lly.dev on 1 Jan 2013 at 2:40

Changed state: Done

GoogleCodeExporter commented 9 years ago

yes, it's a problem with swap 99%.

could you please link this issue to the swap issue?

thanks and happy new year!

Original comment by spameden on 1 Jan 2013 at 3:21

Ernillew / wl500g

dnsmasq: Kernel unaligned instruction access #345