BUG in toi_get_pageset1_load_addresses() (tuxonice_pagedir.c:282)

i336 commented 8 years ago

On an old Dell Latitude X300 subnotebook, the BUG_ON() in line 282 of tuxonice_pagedir.c is firing on resume with 100% repeatability.

280    do {
281      orig_low_pfn = memory_bm_next_pfn(pageset1_map, 0);
282 *    BUG_ON(orig_low_pfn == BM_END_OF_MAP);
283      orig_page = pfn_to_page(orig_low_pfn);
284    } while (PageHighMem(orig_page) ||
285        PagePageset1Copy(orig_page));

What can I printk, and how? Or what other debug info can I collect?

I'm using the tuxonice-4.6 branch cloned directly from GitHub, synced a few days ago.

A bit of Googling turned up one other thread from Oct 2015 with this exact same issue which didn't really go anywhere.

I wondered if the resume code may possibly be being confused by my having multiple swap partitions, so I deleted all except one; no effect. I also tried using a hibernation file instead of swap; no effect there either. ACPI S5 is also a bit flaky on the machine in question, occasionally getting stuck at final poweroff with all keyboard LEDs lit. Using S4 instead produces "Resuming from S4..." at POST, but, again, no effect. (When the system froze under S5 and I hard-powered it off and back on, those resumes also crashed.)

Here's my kernel config and hibernate --bug-report output, along with the full boot log of the panic below. All are gzipped (and zcat/zless-able).

I'm very happy to try any ideas you may have - I'm capturing kernel boot/panic info via RS232, so printk() would be easy to iterate with. (GDB over serial may or may not be a bit much for me to chew; I'm not sure.)

Here are the most relevant portions of the kernel panic (for Google et. al.):

[    0.000000] Linux version 4.6.0 (i336@zukhyri) (gcc version 6.1.1 20160501 (GCC) ) #15 PREEMPT Thu Jul 7 12:46:20 AEST 2016
...
[    1.909719] TuxOnIce 3.3 (http://tuxonice.net)
...
[    2.025369] TuxOnIce: Image found.
...
[    2.062629] Freezing user space processes ... (elapsed 0.000 seconds) done.
[    2.075450] ------------[ cut here ]------------
[    2.076011] kernel BUG at ../kernel/power/tuxonice_pagedir.c:282!
[    2.076011] invalid opcode: 0000 [#1] PREEMPT 
[    2.076011] Modules linked in:
[    2.076011] CPU: 0 PID: 1 Comm: swapper Not tainted 4.6.0 #15
[    2.076011] Hardware name: Dell Computer Corporation Latitude X300/PPPPPP, BIOS A10   11/07/2005
[    2.076011] task: f6850000 ti: f684a000 task.ti: f684a000
[    2.076011] EIP: 0060:[<c106d68f>] EFLAGS: 00010246 CPU: 0
[    2.076011] EIP is at toi_get_pageset1_load_addresses+0x4df/0x7a0
[    2.076011] EAX: ffffffff EBX: f6f69fa0 ECX: ffffffff EDX: f69be004
[    2.076011] ESI: f7105680 EDI: 00000000 EBP: ffffffff ESP: f684be2c
[    2.076011]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    2.076011] CR0: 80050033 CR2: 00000000 CR3: 015bd000 CR4: 000006d0
[    2.076011] Stack:
[    2.076011]  f6850034 f685002c c10615d2 c1061a9b 00000000 c153d4e0 f685002c c153d51c
[    2.076011]  c106208f 00047637 00000000 f713d000 000000ab f6000800 f6000804 00000001
[    2.076011]  00000000 00000001 00000000 f713d000 f713d020 c153e7a0 f6bc7000 f6bc4000
[    2.076011] Call Trace:
[    2.076011]  [<c10615d2>] ? set_next_entity+0x62/0xa0
[    2.076011]  [<c1061a9b>] ? put_prev_task_fair+0x3b/0x70
[    2.076011]  [<c106208f>] ? pick_next_task_fair+0x7f/0x120
[    2.076011]  [<c106cb49>] ? read_pageset1+0x669/0x8c0
[    2.076011]  [<c107d240>] ? hibernation_restore+0xf0/0xf0
[    2.076011]  [<c106932b>] ? do_load_atomic_copy+0x6b/0xb0
[    2.076011]  [<c10695f3>] ? do_check_can_resume+0x33/0x70
[    2.076011]  [<c1069cf5>] ? toi_try_resume+0x65/0x80
[    2.076011]  [<c1069d35>] ? toi_sys_power_disk_try_resume+0x25/0x40
[    2.076011]  [<c107d254>] ? software_resume+0x14/0x2a0
[    2.076011]  [<c1000403>] ? do_one_initcall+0x73/0x1b0
[    2.076011]  [<c107d240>] ? hibernation_restore+0xf0/0xf0
[    2.076011]  [<c100040e>] ? do_one_initcall+0x7e/0x1b0
[    2.076011]  [<c107d240>] ? hibernation_restore+0xf0/0xf0
[    2.076011]  [<c1058bd3>] ? parse_args+0x283/0x4c0
[    2.076011]  [<c1562a95>] ? kernel_init_freeable+0xbf/0x155
[    2.076011]  [<c1562ab2>] ? kernel_init_freeable+0xdc/0x155
[    2.076011]  [<c13fa288>] ? kernel_init+0x8/0x100
[    2.076011]  [<c105eb64>] ? schedule_tail+0x14/0x50
[    2.076011]  [<c13fecae>] ? ret_from_kernel_thread+0x6/0x34
[    2.076011]  [<c13fecc8>] ? ret_from_kernel_thread+0x20/0x34
[    2.076011]  [<c13fa280>] ? rest_init+0x70/0x70
[    2.076011] Code: d2 1a 00 c1 ff 05 89 f9 89 c2 a1 10 cc 5c c1 e8 48 05 01 00 85 c0 74 15 a1 20 cc 5c c1 31 d2 e8 a8 05 01 00 83 f8 ff 89 c5 75 91 <0f> 0b 8b 15 a8 b6 5c c1 89 c7 89 d8 89 54 24 24 e8 dc 91 07 00
[    2.076011] EIP: [<c106d68f>] toi_get_pageset1_load_addresses+0x4df/0x7a0 SS:ESP 0068:f684be2c
[    2.324706] ---[ end trace dac3a6d64d3c60b2 ]---
[    2.329413] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.329413] 
[    2.330387] Kernel Offset: disabled
[    2.330387] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.330387] 
...

i336 commented 8 years ago

Some additional details (TL;DR: no success as yet):

This bug also occurs on a Toshiba Tecra M2. This particular machine unfortunately does not have serial, so panics would be less trivial to gather from this unit; I may be able to get Netlink working, but the PHY may not initialize early enough to be useful. However, the bug's characteristics are exactly the same as on the Dell I originally reported on, as far as I can see.

I thought I'd follow up on the one bit of info in the other thread I found - the reporter mentioned that he experienced the issue upon upgrading to 4.1.9, and I was working with the 4.6 branch, so I tried building a copy of tuxonice-4.0(.9). Unfortunately the bug reproduces identically on 4.0.9 as on 4.6.0.

I'd like to reiterate that I'm somewhat motivated to find a fix for this, and willing to make a bit of effort.

[    1.911966] TuxOnIce 3.3 (http://tuxonice.net)
...
[    1.921843] TuxOnIce: Image found.
...
[    2.035980] Freezing user space processes ... (elapsed 0.000 seconds) done.
[    2.048516] ------------[ cut here ]------------
[    2.049012] kernel BUG at ../kernel/power/tuxonice_pagedir.c:282!
[    2.049012] invalid opcode: 0000 [#1] PREEMPT 
[    2.049012] Modules linked in:
[    2.049012] CPU: 0 PID: 1 Comm: swapper Not tainted 4.0.9 #1
[    2.049012] Hardware name: Dell Computer Corporation Latitude X300/PPPPPP, BIOS A10   11/07/2005
[    2.049012] task: f6848000 ti: f684c000 task.ti: f684c000
[    2.049012] EIP: 0060:[<c1064ac8>] EFLAGS: 00010246 CPU: 0
[    2.049012] EIP is at toi_get_pageset1_load_addresses+0x538/0x7f0
[    2.049012] EAX: ffffffff EBX: 00000000 ECX: f6ae7014 EDX: f6ae7004
[    2.049012] ESI: ffffffff EDI: f710d9e0 EBP: f667d000 ESP: f684de2c
[    2.049012]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    2.049012] CR0: 8005003b CR2: 00000000 CR3: 0157e000 CR4: 000006d0
[    2.049012] Stack:
[    2.049012]  f684821c f6848000 f684802c c15065dc c15065a0 f684802c c15065dc f684802c
[    2.049012]  c1059ebc 00000000 00000000 f71547a0 000000f1 f6bbdb4c f6bbdb48 00000001
[    2.049012]  00000000 00000001 f71547a0 f713d000 00000000 f6bb9000 f6bb6000 f684dea8
[    2.049012] Call Trace:
[    2.049012]  [<c1059ebc>] ? pick_next_task_fair+0xfc/0x170
[    2.049012]  [<c1063f0b>] ? read_pageset1+0x66b/0x8c0
[    2.049012]  [<c1074660>] ? hibernation_restore+0xf0/0xf0
[    2.049012]  [<c106072b>] ? do_load_atomic_copy+0x6b/0xb0
[    2.049012]  [<c10609f3>] ? do_check_can_resume+0x33/0x70
[    2.049012]  [<c10610a5>] ? toi_try_resume+0x65/0x80
[    2.049012]  [<c10610e5>] ? toi_sys_power_disk_try_resume+0x25/0x40
[    2.049012]  [<c105e0d2>] ? try_tuxonice_resume+0x22/0x60
[    2.049012]  [<c11ef879>] ? kvasprintf+0x49/0x60
[    2.049012]  [<c1074674>] ? software_resume+0x14/0x2a0
[    2.049012]  [<c1000403>] ? do_one_initcall+0x73/0x1b0
[    2.049012]  [<c1074660>] ? hibernation_restore+0xf0/0xf0
[    2.049012]  [<c100040e>] ? do_one_initcall+0x7e/0x1b0
[    2.049012]  [<c1074660>] ? hibernation_restore+0xf0/0xf0
[    2.049012]  [<c152648a>] ? repair_env_string+0xf/0x50
[    2.049012]  [<c1050912>] ? parse_args+0x242/0x400
[    2.049012]  [<c1526ac3>] ? kernel_init_freeable+0xd3/0x14b
[    2.049012]  [<c13cfee8>] ? kernel_init+0x8/0xe0
[    2.049012]  [<c105654e>] ? schedule_tail+0x1e/0x60
[    2.049012]  [<c13d5d26>] ? ret_from_kernel_thread+0x6/0x30
[    2.049012]  [<c13d5d40>] ? ret_from_kernel_thread+0x20/0x30
[    2.049012]  [<c13cfee0>] ? rest_init+0x70/0x70
[    2.049012] Code: e9 c1 f9 05 89 c2 a1 90 cc 58 c1 e8 33 05 01 00 85 c0 74 19 a1 a0 cc 58 c1 31 d2 e8 93 05 01 00 83 f8 ff 89 c6 0f 85 78 ff ff ff <0f> 0b 89 d8 8b 2d 48 b7 58 c1 e8 29 f6 06 00 8b 1d 48 b7 58 c1
[    2.049012] EIP: [<c1064ac8>] toi_get_pageset1_load_addresses+0x538/0x7f0 SS:ESP 0068:f684de2c
[    2.293895] ---[ end trace 7a5278bb80335815 ]---
[    2.298557] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.298557] 
[    2.299514] Kernel Offset: 0x0 from 0xc1000000 (relocation range: 0xc0000000-0xf7ffdfff)
[    2.299514] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

i336 commented 8 years ago

More news! This time I've made a tiny bit of progress.

The tuxonice-3.9 branch suspends and resumes successfully, at least on the Dell I originally reported on. So far I've just done one cycle, but I'm confident I'm out of the red zone, considering I was getting 100% consistent failure before.

Only problem is, I don't really want to use 3.9; I want to play with 4.6 and beyond. I don't admittedly need any 4.x-specific functionality or drivers, but being able to upgrade would be a nice idea.

Getting 3.9 compiled was quite a challenge; let's just say I needed those Slackware chroots anyway... (Arch - my base OS - is on GCC 6.1, while linux-4.0 used GCC 5, and -3.9 used GCC 4. Yes, I could have just copied compiler-gccN.h from the relevant mailinglist patch, but I wanted to keep things as stock/vanilla/simple as possible considering the circumstances.)

Before I went chroot-building, I stumbled on the Ubuntu TuxOnIce PPA, and gave those kernels a shot. I tested 3.0.0, 3.2.0, 4.2.0 and 4.4.0; the 3.x series worked, the 4.x series crashed exactly as described in my previous messages.

If I think of anything else or anything new happens I'll update this thread; if there are no updates that means I'm still stuck on 3.9.

Once again, I'd like to say that I'm still very interested to figure out why this is happening and see if it can be fixed.

In terms of reproducing this:

I would suggest targeting an old laptop from 2005-2007 with 1-2GB of RAM
Kernel configuration doesn't seem to matter, considering that the PPA kernels fail exactly like my own built versions do (using the .config attached to my 1st post)
Again, 3.9(.11) works; 4.0(.9) fails. Considering that you don't have any intermediary branches between those two points, I don't think I can go much deeper in terms of bisecting this.

If you can't reproduce this, suggestions about debug info I can supply (or requests to test on more machines, which I can make the effort to do if it would be helpful) are welcome.

uorol commented 7 years ago

Dear @i336 I have also met the same issue on my Android board. From the code trace I though it might due to the pages we allocate is changes between high/low mems.

1.) The new pages we allocate is record to pageset1_copy_map, the Low mem pages is much more than High mem pages, and they are not allocated continuously. (Might got HIGH - LOW - HIGH - LOW - HIGH pages...) 2.) When we go through the pageset1_copy_map in the for loop. for (pfn = memory_bm_next_pfn(pageset1_copy_map, 0); pfn != BM_END_OF_MAP; pfn = memory_bm_next_pfn(pageset1_copy_map, 0)) => What it do is:

Skip the pages already matchs pageset1_map
If this is High mem page, find the High mem page in pageset1_map, and put it on High mem pages in pageset1_copy_map (Or Low mem pages while the low_pages_for_highmem is not equal to zero yet.)
If this is Low mem page, find the Low mem page in pageset1_map (same as below)
However the memory_bm_next_pfn will increase the bm->cur[index].node_bit and return that pfn, so I think that's why always hit the BUG.

Currently I had separate the for-loop into 2 parts, search High mem pages first, then do memory_bm_position_reset(pageset1_map); memory_bm_position_reset(pageset1_copy_map); and search for Low mem pages again.

This can help avoid hit that BUG. (However I will hit another ENOSPC error and still under investigation...)

NigelCunningham / tuxonice-kernel-old

BUG in toi_get_pageset1_load_addresses() (tuxonice_pagedir.c:282) #17