free_pages NULL pointer dereference during mkfs.xfs

venkrishr commented 7 years ago

mkfs.xfs crashes dmdedup and we have been able to locate the BUG to free_pages() function call in the my_endio() function in dm-dedup-rw.c file... You can see the call trace below:

[ 499.446133] Call Trace: [ 499.446652] [ 499.447070] [] ? free_pages.part.66+0x40/0x50 [ 499.448394] [] free_pages+0x13/0x20 [ 499.449450] [] my_endio+0x4d/0x70 [dm_dedup] [ 499.450658] [] bio_endio+0x5b/0xa0 [ 499.451700] [] blk_update_request+0x90/0x360 [ 499.452915] [] scsi_end_request+0x34/0x1e0 [ 499.454107] [] scsi_io_completion+0x119/0x6c0 [ 499.455359] [] scsi_finish_command+0xcf/0x130 [ 499.456577] [] scsi_softirq_done+0x137/0x160 [ 499.457743] [] blk_done_softirq+0x90/0xc0 [ 499.458859] [] __do_softirq+0xf4/0x2d0 [ 499.459992] [] irq_exit+0x125/0x130 [ 499.461021] [] do_IRQ+0x5a/0xf0 [ 499.462002] [] common_interrupt+0x6d/0x6d

The following is the code corresponding to the crash:

if (rw == WRITE || rw == READ) { bv = bio_iovec(clone); if (bv.bv_page) { free_pages((unsigned long)page_address(bv.bv_page),0); bv.bv_page = NULL; } }

I tried to print the address of clone and its corresponding page for better clarity and it seems clone seems to be allocated in the same address and its page alternates between two addresses:

[ Time ] READ or WRITE clone clone_address page page_address

[ 584.827534] WRITE clone ffff88007b0d86c0 page ffffea0001de5f80 [ 584.830866] WRITE clone ffff88007b0d86c0 page ffffea0001dbd940 [ 584.834205] WRITE clone ffff88007b0d86c0 page ffffea0001de5f80 [ 584.837604] WRITE clone ffff88007b0d86c0 page ffffea0001dbd940 [ 584.840956] WRITE clone ffff88007b0d86c0 page ffffea0001de5f80 [ 584.844387] WRITE clone ffff880077888d80 page ffffea0001de0480 [ 584.847775] WRITE clone ffff88007b0d86c0 page ffffea0001dbd940 [ 584.851202] WRITE clone ffff88007b0d86c0 page ffffea0001de0480 [ 584.854625] WRITE clone ffff88007b0d86c0 page ffffea0001dbd940 [ 584.857990] WRITE clone ffff88007b0d86c0 page ffffea0001de0480 [ 584.862667] WRITE clone ffff88007b0d86c0 page (null) [ 584.865625] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c

This seems to be a RACE condition and I can avoid the BUG if I comment out this part of the code(but obviously leaking memory), locking is another solution but I'm trying to understand how this situation becomes possible in the first place.

We construct only one clone per bio in dm-dedup-rw.c and it had only one page(since total size of chunk will not exceed 4KB) and it is destroyed in the my_endio function. How is it possible that two threads are trying to free the same page? Any help would be much appreciated.

sectorsize512 commented 7 years ago

Could you paste the lines in the crash dump preceding the call trace?

venkrishr commented 7 years ago

I have put a BUG_ON statement like this:

BUG_ON(bio_page(clone) == NULL);

This is the dump before the Call Trace

[ 1511.639626] ------------[ cut here ]------------ [ 1511.642782] kernel BUG at drivers/md/dm-dedup-rw.c:110! [ 1511.646158] invalid opcode: 0000 [#1] SMP [ 1511.649201] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon serio_raw vmw_vmci sg parport_pc shpchp i2c_piix4 pata_acpi parport ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic vmwgfx drm_kms_helper ttm drm ata_piix crc32c_intel mptspi libata e1000 scsi_transport_spi mptscsih mptbase i2c_core floppy [ 1511.676299] CPU: 2 PID: 158 Comm: kworker/2:2 Tainted: G E 3.19.3+ #6 [ 1511.680782] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 [ 1511.687932] Workqueue: events fb_deferred_io_work [ 1511.691361] task: ffff880035dc0d70 ti: ffff880035dc8000 task.ti: ffff880035dc8000 [ 1511.695906] RIP: 0010:[] [] my_endio+0xfd/0x100 [dm_dedup] [ 1511.701087] RSP: 0018:ffff88007fd03cf8 EFLAGS: 00010246 [ 1511.704741] RAX: ffff8800779a1bd8 RBX: ffff8800779a1b40 RCX: 0000000000000000 [ 1511.709759] RDX: 0000000000000000 RSI: ffff88007fd0e6d8 RDI: ffff88007fd0e6d8 [ 1511.714223] RBP: ffff88007fd03d08 R08: 0000000000000000 R09: ffff8800779e3c20 [ 1511.718709] R10: 00000000000005b1 R11: ffffc90008f107f8 R12: 0000000000000000 [ 1511.723130] R13: 0000000000000000 R14: 0000000000001000 R15: ffff880035c9ce60 [ 1511.727574] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [ 1511.732394] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1511.736243] CR2: 00007ff836d67000 CR3: 000000007775e000 CR4: 00000000000407e0 [ 1511.740703] Stack: [ 1511.743004] ffff8800779a1b40 00000000fffffffb ffff88007fd03d38 ffffffff81329c8b [ 1511.747906] 0000000000000001 ffff8800779a1b40 ffff8800779a1b40 0000000000001000 [ 1511.752830] ffff88007fd03d88 ffffffff81331060 000000807fd03d68 0000000000000000 [ 1511.757700] Call Trace:

sectorsize512 commented 7 years ago

1) So, the call trace you posted in the beginning is with BUG_ON()?

2) Where exactly did you put this BUG_ON? (It seems that the code actually checks for bv.bv_page not being NULL). In general, please, paste all changes you made to the code when you collected corresponding crash dumps in a form of a patch. This is a conventional and very conventional practice that helps to avoid confusions.

Thanks!

venkrishr commented 7 years ago

The call trace at the beginning is without any changes to the code, though both call traces before and after the changes are the same as they crash at the same place.

Following is the dump without any changes to the code:

[71862.062151] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[71862.065902] IP: [<ffffffff81187aa5>] __free_pages+0x5/0x30
[71862.068507] PGD 7790e067 PUD 7bea0067 PMD 0 
[71862.070657] Oops: 0002 [#1] SMP 
[71862.072308] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper ppdev cryptd vmw_balloon sg serio_raw parport_pc parport shpchp vmw_vmci i2c_piix4 pata_acpi ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic vmwgfx drm_kms_helper ttm drm mptspi crc32c_intel scsi_transport_spi ata_piix mptscsih mptbase libata e1000 i2c_core floppy
[71862.093987] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E  3.19.3+ #6
[71862.097175] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[71862.101881] task: ffff88007cd3b5c0 ti: ffff88007cd4c000 task.ti: ffff88007cd4c000
[71862.105200] RIP: 0010:[<ffffffff81187aa5>]  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[71862.108800] RSP: 0018:ffff88007fd03cb0  EFLAGS: 00010247
[71862.111169] RAX: ffffea0000000000 RBX: ffff880077ff6600 RCX: ffff880077ff6688
[71862.114338] RDX: 0000000080000000 RSI: 0000000000000000 RDI: 0000000000000000
[71862.117487] RBP: ffff88007fd03cb8 R08: 0000000000000001 R09: 0000000000000000
[71862.120702] R10: ffff8800355c2c80 R11: 0000000000001000 R12: 00000000fffffffb
[71862.123868] R13: 0000000000000000 R14: 0000000000001000 R15: ffff8800356d6000
[71862.127016] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[71862.130554] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[71862.133134] CR2: 000000000000001c CR3: 000000007744e000 CR4: 00000000000407e0
[71862.136527] Stack:
[71862.137819]  ffffffff81187b10 ffff88007fd03cc8 ffffffff81187b33 ffff88007fd03ce8
[71862.141995]  ffffffffa036478d 0000000000000000 ffff880077ff6600 ffff88007fd03d18
[71862.145600]  ffffffff81329c8b ffff88007fd03d08 ffff880077ff6600 ffff880077ff6600
[71862.149182] Call Trace:
[71862.151970]  <IRQ> 
[71862.152906]  [<ffffffff81187b10>] ? free_pages.part.66+0x40/0x50
[71862.158685]  [<ffffffff81187b33>] free_pages+0x13/0x20
[71862.162436]  [<ffffffffa036478d>] my_endio+0x4d/0x70 [dm_dedup]
[71862.166561]  [<ffffffff81329c8b>] bio_endio+0x5b/0xa0
[71862.170266]  [<ffffffff81331060>] blk_update_request+0x90/0x360
[71862.174329]  [<ffffffff814b3574>] scsi_end_request+0x34/0x1e0
[71862.178239]  [<ffffffff814b5729>] scsi_io_completion+0x119/0x6c0
[71862.182289]  [<ffffffff810e7ff6>] ? hrtimer_get_next_event+0xb6/0xc0
[71862.186440]  [<ffffffff814aab7f>] scsi_finish_command+0xcf/0x130
[71862.190400]  [<ffffffff814b4db7>] scsi_softirq_done+0x137/0x160
[71862.194334]  [<ffffffff813381f0>] blk_done_softirq+0x90/0xc0
[71862.198146]  [<ffffffff81083ae4>] __do_softirq+0xf4/0x2d0
[71862.201808]  [<ffffffff81083fb5>] irq_exit+0x125/0x130
[71862.205366]  [<ffffffff8165814a>] do_IRQ+0x5a/0xf0
[71862.208740]  [<ffffffff81655fad>] common_interrupt+0x6d/0x6d
[71862.212410]  <EOI> 
[71862.213321]  [<ffffffff8105e9c6>] ? native_safe_halt+0x6/0x10
[71862.218344]  [<ffffffff810ded48>] ? rcu_eqs_enter+0x68/0x90
[71862.221938]  [<ffffffff8101f68e>] default_idle+0x1e/0xc0
[71862.225345]  [<ffffffff8102004f>] arch_cpu_idle+0xf/0x20
[71862.228739]  [<ffffffff810c24e8>] cpu_startup_entry+0x378/0x3b0
[71862.232416]  [<ffffffff810f404c>] ? clockevents_register_device+0xbc/0x120
[71862.236446]  [<ffffffff8104a9cc>] start_secondary+0x18c/0x200
[71862.239989] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a9 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 63 
[71862.255984] RIP  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[71862.259618]  RSP <ffff88007fd03cb0>
[71862.262258] CR2: 000000000000001c
[71862.266966] ---[ end trace 4e0c7f4f3a78ee78 ]---
[71862.270068] Kernel panic - not syncing: Fatal exception in interrupt
[71862.274057] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[71862.314774] Rebooting in 5 seconds..

I made the following changes including BUG_ON:

if (rw == WRITE || rw == READ) {
        struct page *pg = bio_page(clone);
        printk("WRITE clone %p page %p\n", clone, pg);
        BUG_ON(bio_page(clone) == NULL);
        free_pages((unsigned long) page_address(pg), 0);
}

The dump after the changes:

[  721.236142] ------------[ cut here ]------------
[  721.239254] kernel BUG at drivers/md/dm-dedup-rw.c:110!
[  721.242639] invalid opcode: 0000 [#1] SMP
[  721.245693] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev serio_raw vmw_balloon sg vmw_vmci shpchp parport_pc parport i2c_piix4 pata_acpi ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic vmwgfx drm_kms_helper ttm drm e1000 mptspi scsi_transport_spi ata_piix mptscsih crc32c_intel mptbase libata i2c_core floppy
[  721.271404] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E  3.19.3+ #6
[  721.275506] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[  721.282065] task: ffff88007cd3b5c0 ti: ffff88007cd4c000 task.ti: ffff88007cd4c000
[  721.286277] RIP: 0010:[<ffffffffa032283d>]  [<ffffffffa032283d>] my_endio+0xfd/0x100 [dm_dedup]
[  721.291072] RSP: 0018:ffff88007fd03cd8  EFLAGS: 00010246
[  721.294424] RAX: ffff8800790e9158 RBX: ffff8800790e90c0 RCX: 0000000000000000
[  721.298513] RDX: 0000000000000000 RSI: ffff88007fd0e6d8 RDI: 0000000000000300
[  721.302604] RBP: ffff88007fd03ce8 R08: 0000000000000000 R09: ffff880077a9dc20
[  721.306702] R10: 0000000000002e33 R11: ffffc90008a107f8 R12: 0000000000000000
[  721.310789] R13: 0000000000000000 R14: 0000000000001000 R15: ffff8800774e02e0
[  721.314861] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  721.319337] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  721.322930] CR2: 00007f5da7765000 CR3: 0000000034cae000 CR4: 00000000000407e0
[  721.327044] Stack:
[  721.329157]  ffff8800790e90c0 00000000fffffffb ffff88007fd03d18 ffffffff81329c8b
[  721.333616]  ffff88007fd03d08 ffff8800790e90c0 ffff8800790e90c0 0000000000001000
[  721.338111]  ffff88007fd03d68 ffffffff81331060 ffff88007fd03da8 0000000000000000
[  721.342576] Call Trace:
[  721.344883]  <IRQ>
[  721.345700]  [<ffffffff81329c8b>] bio_endio+0x5b/0xa0
[  721.350515]  [<ffffffff81331060>] blk_update_request+0x90/0x360
[  721.354185]  [<ffffffff814b3574>] scsi_end_request+0x34/0x1e0
[  721.357792]  [<ffffffff814b5729>] scsi_io_completion+0x119/0x6c0
[  721.361484]  [<ffffffff810e7ff6>] ? hrtimer_get_next_event+0xb6/0xc0
[  721.365299]  [<ffffffff814aab7f>] scsi_finish_command+0xcf/0x130
[  721.368917]  [<ffffffff814b4db7>] scsi_softirq_done+0x137/0x160
[  721.372538]  [<ffffffff813381f0>] blk_done_softirq+0x90/0xc0
[  721.376076]  [<ffffffff81083ae4>] __do_softirq+0xf4/0x2d0
[  721.379439]  [<ffffffff81083fb5>] irq_exit+0x125/0x130
[  721.382769]  [<ffffffff8165814a>] do_IRQ+0x5a/0xf0

sectorsize512 commented 7 years ago

Hmm, but it seems one check is missing in your patched code (compared to the version in git):

https://github.com/dmdedup/dmdedup/blob/master/dm-dedup-rw.c#L107

Am I misunderstanding something?...

Vasily

venkrishr commented 7 years ago

Yes indeed, I do need to check whether it is NULL, this patch is just for debugging purposes... But why is it crashing in the unmodified code even with the check in place? That was my question... Also why is the page being explicitly set to NULL after free_pages?

sectorsize512 commented 7 years ago

Can I ask you to reverify (by rerunning experiment on original code) that there is in fact a bug? I'm concerned because your BUG_ON seems like a replacement for the earlier check which you removed. So, it is hard to trust the results. Therefore, just to make sure we are all on the same page, could you rerun experiment form scratch and paste the crash dump here again? Thanks!

venkrishr commented 7 years ago

Sure. I have rerun the experiment on the original code and there IS a bug. It crashes, the crash dump is as follows:

[ 9752.519252] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 9752.522705] IP: [<ffffffff81187aa5>] __free_pages+0x5/0x30
[ 9752.525173] PGD 77be9067 PUD 762e7067 PMD 7747b067 PTE 0
[ 9752.527763] Oops: 0002 [#1] SMP 
[ 9752.529315] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper ppdev cryptd sg serio_raw vmw_balloon parport_pc shpchp parport vmw_vmci i2c_piix4 pata_acpi ip_tables ext4 mbcache jbd2 sr_mod cdrom vmwgfx sd_mod ata_generic drm_kms_helper ttm drm mptspi crc32c_intel ata_piix scsi_transport_spi mptscsih libata i2c_core e1000 mptbase floppy
[ 9752.550221] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E  3.19.3+ #6
[ 9752.553298] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[ 9752.557881] task: ffff88007cd3b5c0 ti: ffff88007cd4c000 task.ti: ffff88007cd4c000
[ 9752.561121] RIP: 0010:[<ffffffff81187aa5>]  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[ 9752.564604] RSP: 0018:ffff88007fd03cb0  EFLAGS: 00010203
[ 9752.566895] RAX: ffffea0000000000 RBX: ffff880035c69300 RCX: ffff880035c69388
[ 9752.569946] RDX: 0000000080000000 RSI: 0000000000000000 RDI: 00000000008d5f80
[ 9752.573004] RBP: ffff88007fd03cb8 R08: 0000000000000001 R09: 0000000000000000
[ 9752.576090] R10: ffff880035702d40 R11: 0000000000001000 R12: 00000000fffffffb
[ 9752.579144] R13: 0000000000000000 R14: 0000000000001000 R15: ffff880075da7700
[ 9752.582231] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 9752.585686] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 9752.588155] CR2: 00000000008d5f9c CR3: 0000000035483000 CR4: 00000000000407e0
[ 9752.591233] Stack:
[ 9752.592142]  ffffffff81187b10 ffff88007fd03cc8 ffffffff81187b33 ffff88007fd03ce8
[ 9752.595627]  ffffffffa037278d 0000000000000000 ffff880035c69300 ffff88007fd03d18
[ 9752.599066]  ffffffff81329c8b ffff88007fd03d08 ffff880035c69300 ffff880035c69300
[ 9752.602516] Call Trace:
[ 9752.605235]  <IRQ> 
[ 9752.606143]  [<ffffffff81187b10>] ? free_pages.part.66+0x40/0x50
[ 9752.611840]  [<ffffffff81187b33>] free_pages+0x13/0x20
[ 9752.615551]  [<ffffffffa037278d>] my_endio+0x4d/0x70 [dm_dedup]
[ 9752.619573]  [<ffffffff81329c8b>] bio_endio+0x5b/0xa0
[ 9752.623168]  [<ffffffff81331060>] blk_update_request+0x90/0x360
[ 9752.627096]  [<ffffffff814b3574>] scsi_end_request+0x34/0x1e0
[ 9752.630930]  [<ffffffff814b5729>] scsi_io_completion+0x119/0x6c0
[ 9752.634876]  [<ffffffff810e7ff6>] ? hrtimer_get_next_event+0xb6/0xc0
[ 9752.638928]  [<ffffffff814aab7f>] scsi_finish_command+0xcf/0x130
[ 9752.642846]  [<ffffffff814b4db7>] scsi_softirq_done+0x137/0x160
[ 9752.646717]  [<ffffffff813381f0>] blk_done_softirq+0x90/0xc0
[ 9752.650481]  [<ffffffff81083ae4>] __do_softirq+0xf4/0x2d0
[ 9752.654123]  [<ffffffff81083fb5>] irq_exit+0x125/0x130
[ 9752.657580]  [<ffffffff8165814a>] do_IRQ+0x5a/0xf0
[ 9752.660858]  [<ffffffff81655fad>] common_interrupt+0x6d/0x6d
[ 9752.664467]  <EOI> 
[ 9752.665354]  [<ffffffff8105e9c6>] ? native_safe_halt+0x6/0x10
[ 9752.670340]  [<ffffffff810ded48>] ? rcu_eqs_enter+0x68/0x90
[ 9752.673849]  [<ffffffff8101f68e>] default_idle+0x1e/0xc0
[ 9752.677191]  [<ffffffff8102004f>] arch_cpu_idle+0xf/0x20
[ 9752.680545]  [<ffffffff810c24e8>] cpu_startup_entry+0x378/0x3b0
[ 9752.684101]  [<ffffffff810f404c>] ? clockevents_register_device+0xbc/0x120
[ 9752.688054]  [<ffffffff8104a9cc>] start_secondary+0x18c/0x200
[ 9752.691523] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a9 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 63 
[ 9752.707103] RIP  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[ 9752.710672]  RSP <ffff88007fd03cb0>
[ 9752.713271] CR2: 00000000008d5f9c
[ 9752.717922] ---[ end trace 228fa87d732a2388 ]---
[ 9752.720974] Kernel panic - not syncing: Fatal exception in interrupt
[ 9752.724952] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 9752.736812] Rebooting in 5 seconds..

sectorsize512 commented 7 years ago

Thanks! So, it seems the process is trying to reference some field in a structure at offset 1c, and the pointer to the structure turns out to be NULL. Could you figure out which pointer is equal to NULL (by looking through the code, not running experiments)?

venkrishr commented 7 years ago

Yes, the pointer to the page is NULL i.e the struct page pointer in the cloned bio is NULL. The pointer is inside struct bio_vec(vector list) inside the struct bio.

struct bio_vec {
         struct page *bv_page;
         unsigned int bv_len;
         unsigned int bv_offset;
};

We are trying to free the struct page which we allocated while creating the clone. The pointer which can be obtained from "bv.bv_page" or returned from "bio_page(clone)" turns out to be NULL.

This should not happen because we are allocating a new page every time a new clone is being created.

sectorsize512 commented 7 years ago

OK, let's not jump to the conclusions so fast :)

Where exactly does the "1c" number in the crash dump come from? "1c" is 28 bytes in decimal. So, the offset of the field within the dereferenced structure should be 28 bytes. (Only in this case if the pointer to that structure is NULL, then the total dereferenced address is equal to NULL + 28 = 1c.)

So, if I understand you correctly, you claim that bv_page pointer in struct bio_vec is NULL. Right? In our code, this pointer is first passed to page_address() function (line 108). How come we don't crash in that function?

sectorsize512 commented 7 years ago

Also, to simplify the code (not to fix the problem) let's not translate page to virtual address, but let's use __free_pages() function directly. I.e.,

instead of

free_pages((unsigned long)page_address(bv.bv_page), 0);

let's have:

__free_pages(bv.bv_page, 0)

Thanks.

venkrishr commented 7 years ago

Sure. I made the changes to directly call __free_pages without translating to virtual address. This is the new crash dump:

[38436.122665] BUG: unable to handle kernel NULL pointer dereference at 0000000000000032
[38436.126673] IP: [<ffffffff81187aa5>] __free_pages+0x5/0x30
[38436.129591] PGD 0
[38436.130995] Oops: 0002 [#1] SMP
[38436.133029] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper ppdev cryptd vmw_balloon serio_raw sg parport_pc parport shpchp vmw_vmci pata_acpi i2c_piix4 ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih ata_piix e1000 crc32c_intel libata mptbase i2c_core floppy [last unloaded: dm_dedup]
[38436.158936] CPU: 2 PID: 12847 Comm: kworker/2:0 Tainted: G            E  3.19.3+ #6
[38436.162972] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[38436.169306] Workqueue: events fb_deferred_io_work
[38436.172289] task: ffff880035755e10 ti: ffff88004c4ac000 task.ti: ffff88004c4ac000
[38436.176505] RIP: 0010:[<ffffffff81187aa5>]  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[38436.180897] RSP: 0018:ffff88007fd03cd0  EFLAGS: 00010246
[38436.184185] RAX: 0000000000000032 RBX: ffff880035ca3d80 RCX: 0000000000000000
[38436.188344] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000016
[38436.192481] RBP: ffff88007fd03ce8 R08: 0000000000000000 R09: ffff880077a5fc20
[38436.196653] R10: 0000000000002e4e R11: ffffc90008e827f8 R12: 0000000000000016
[38436.200770] R13: 0000000000000000 R14: 0000000000001000 R15: ffff8800351fb590
[38436.204955] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[38436.209538] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[38436.213164] CR2: 0000000000000032 CR3: 000000003570a000 CR4: 00000000000407e0
[38436.217647] Stack:
[38436.219747]  ffffffffa036b780 ffff880035ca3d80 00000000fffffffb ffff88007fd03d18
[38436.224519]  ffffffff81329c8b ffff88007fd03d08 ffff880035ca3d80 ffff880035ca3d80
[38436.229265]  0000000000001000 ffff88007fd03d68 ffffffff81331060 ffff88007fd03da8
[38436.234024] Call Trace:
[38436.236472]  <IRQ>
[38436.237381]  [<ffffffffa036b780>] ? my_endio+0x40/0x60 [dm_dedup]
[38436.242882]  [<ffffffff81329c8b>] bio_endio+0x5b/0xa0
[38436.246536]  [<ffffffff81331060>] blk_update_request+0x90/0x360
[38436.250534]  [<ffffffff814b3574>] scsi_end_request+0x34/0x1e0
[38436.254437]  [<ffffffff814b5729>] scsi_io_completion+0x119/0x6c0
[38436.258503]  [<ffffffff814aab7f>] scsi_finish_command+0xcf/0x130
[38436.262478]  [<ffffffff814b4db7>] scsi_softirq_done+0x137/0x160
[38436.266420]  [<ffffffff813381f0>] blk_done_softirq+0x90/0xc0
[38436.270264]  [<ffffffff81083ae4>] __do_softirq+0xf4/0x2d0
[38436.273974]  [<ffffffff81083fb5>] irq_exit+0x125/0x130
[38436.277582]  [<ffffffff8165814a>] do_IRQ+0x5a/0xf0
[38436.280960]  [<ffffffff81655fad>] common_interrupt+0x6d/0x6d
[38436.284638]  <EOI> 
[38436.285552]  [<ffffffff810b72ea>] ? dequeue_entity+0x11a/0x550
[38436.290776]  [<ffffffff81328aa8>] ? bio_put+0x78/0x80
[38436.294164]  [<ffffffffa01dec74>] ? vmw_deferred_io+0x204/0x2b0 [vmwgfx]
[38436.298232]  [<ffffffffa01dec7d>] ? vmw_deferred_io+0x20d/0x2b0 [vmwgfx]
[38436.302273]  [<ffffffff813d037b>] fb_deferred_io_work+0x7b/0xd0
[38436.305929]  [<ffffffff810989ad>] process_one_work+0x14d/0x400
[38436.309530]  [<ffffffff81099152>] worker_thread+0x112/0x510
[38436.312987]  [<ffffffff81099040>] ? rescuer_thread+0x3e0/0x3e0
[38436.316543]  [<ffffffff8109e548>] kthread+0xd8/0xf0
[38436.319681]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[38436.323539]  [<ffffffff81655398>] ret_from_fork+0x58/0x90
[38436.326938]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[38436.330785] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a9 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 63 
[38436.346647] RIP  [<ffffffff81187aa5>] __free_pages+0x5/0x30
[38436.350192]  RSP <ffff88007fd03cd0>
[38436.352804] CR2: 0000000000000032
[38436.357491] ---[ end trace d65575a7e7e415af ]---
[38436.360540] Kernel panic - not syncing: Fatal exception in interrupt
[38436.364471] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[38436.379714] Rebooting in 5 seconds..

sectorsize512 commented 7 years ago

Thanks. Interestingly, now 1c changed to 32. Anyway, if our current theory is that bv.page is NULL then it becomes NULL right between the if (bv.bv_page) check and __free_pages(...) call. (Because the pointer is passed to _free_pages by value).

Can we insert printks to my_endio() to print every bio that is passed to it? Specifically, let's start with printing every bio structure's in-memory address, error argument, bio->bi_flags, and bio->bi_rw. And see if we still can reproduce the bug.

Also, to always be on the same page, let's adopt the following procedure. In your github repository you create a branch for debugging this issue. The exact code for every experiment you run you put in that branch as a separate commit. Then, when you put the results here (in the issues) you refer to the commit that contained the code. This allows me to checkout the commit and look at the exact code you were running.

Thanks!

venkrishr commented 7 years ago

Sometimes the mkfs.xfs doesn't crash but the kernel crashes immediately after mkfs.xfs completes. The following are some of the crash dumps that I get:

[13894.948716] Kernel panic - not syncing: CRED: put_cred_rcu() sees ffff880034ec59c0 with usage 0
[13894.948716] 
[13894.954441] CPU: 0 PID: 18 Comm: rcuos/1 Tainted: G            E  3.19.3+ #6
[13894.958125] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[13894.964306]  0000000000000000 00000000aa006e5a ffff88007cd83d78 ffffffff8164dcad
[13894.968558]  0000000000000000 ffffffff81881dc8 ffff88007cd83df8 ffffffff816488ed
[13894.972736]  0000000000000018 ffff88007cd83e08 ffff88007cd83da8 00000000aa006e5a
[13894.977029] Call Trace:
[13894.978830]  [<ffffffff8164dcad>] dump_stack+0x45/0x57
[13894.981800]  [<ffffffff816488ed>] panic+0xd0/0x204
[13894.984633]  [<ffffffff810a0362>] put_cred_rcu+0xc2/0xd0
[13894.987750]  [<ffffffff810df37f>] rcu_nocb_kthread+0x1af/0x5b0
[13894.991059]  [<ffffffff810c1eb0>] ? prepare_to_wait_event+0xf0/0xf0
[13894.994629]  [<ffffffff810df1d0>] ? rcu_spawn_all_nocb_kthreads.part.47+0x160/0x160
[13894.998832]  [<ffffffff8109e548>] kthread+0xd8/0xf0
[13895.001820]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[13895.005537]  [<ffffffff81655398>] ret_from_fork+0x58/0x90
[13895.008804]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[13895.012678] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[13895.021220] Rebooting in 5 seconds..

[  646.585622] BUG: unable to handle kernel paging request at 00000000002396bc
[  646.589418] IP: [<ffffffff810a0295>] override_creds+0x25/0x30
[  646.592246] PGD 0 
[  646.593247] Oops: 0000 [#1] SMP 
[  646.594823] Modules linked in: dm_dedup(E) dm_persistent_data dm_bufio libcrc32c dm_mod coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw ppdev gf128mul glue_helper ablk_helper cryptd vmw_balloon serio_raw parport_pc sg vmw_vmci shpchp parport i2c_piix4 pata_acpi ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic vmwgfx drm_kms_helper ttm drm ata_piix libata mptspi scsi_transport_spi crc32c_intel mptscsih e1000 mptbase i2c_core floppy
[  646.616179] CPU: 0 PID: 32 Comm: rcuos/3 Tainted: G            E  3.19.3+ #6
[  646.619406] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[  646.623969] task: ffff88007cdd5e10 ti: ffff88007ce18000 task.ti: ffff88007ce18000
[  646.627249] RIP: 0010:[<ffffffff810a0295>]  [<ffffffff810a0295>] override_creds+0x25/0x30
[  646.630944] RSP: 0018:ffff88007ce1be08  EFLAGS: 00010202
[  646.633280] RAX: 00000000002396bc RBX: ffff880035ef98d0 RCX: dead000000200200
[  646.636405] RDX: ffffffff810a02a0 RSI: 0000000000000297 RDI: ffff880035ef98d0
[  646.639500] RBP: ffff88007ce1be18 R08: ffff88007ce1be70 R09: 0000000000000000
[  646.642584] R10: ffffffff819cccc0 R11: 0000000000000400 R12: ffff880035ef9840
[  646.645700] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007cdd5e10
[  646.648794] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  646.652325] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  646.654909] CR2: 00000000002396bc CR3: 0000000035e75000 CR4: 00000000000407f0
[  646.658275] Stack:
[  646.659220]  ffff880035ef98d0 ffff88007fd94e00 ffff88007ce1beb8 ffffffff810df37f
[  646.662700]  ffff88007cdd5e10 ffff88007ce1be70 000000007ce1bea8 ffff880035ef98d0
[  646.666218]  ffff88007cdd5e10 00000000000140c0 ffff880000000000 ffff88007cdd5e10
[  646.669843] Call Trace:
[  646.672498]  [<ffffffff810df37f>] rcu_nocb_kthread+0x1af/0x5b0
[  646.676515]  [<ffffffff810c1eb0>] ? prepare_to_wait_event+0xf0/0xf0
[  646.680711]  [<ffffffff810df1d0>] ? rcu_spawn_all_nocb_kthreads.part.47+0x160/0x160
[  646.685560]  [<ffffffff8109e548>] kthread+0xd8/0xf0
[  646.689190]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[  646.693556]  [<ffffffff81655398>] ret_from_fork+0x58/0x90
[  646.697404]  [<ffffffff8109e470>] ? kthread_create_on_node+0x1b0/0x1b0
[  646.701674] Code: 00 00 00 0f 1f 00 66 66 66 66 90 55 8b 0f 65 48 8b 14 25 c0 b8 00 00 48 89 e5 48 8b 82 d0 08 00 00 f0 ff 07 48 89 ba d0 08 00 00 <8b> 10 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 
[  646.718643] RIP  [<ffffffff810a0295>] override_creds+0x25/0x30
[  646.722700]  RSP <ffff88007ce1be08>
[  646.726042] CR2: 00000000002396bc
[  646.731227] ---[ end trace 2c02424ff36d6440 ]---
[  646.734709] Kernel panic - not syncing: Fatal exception in interrupt
[  646.738968] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  646.746156] Rebooting in 5 seconds..

The corresponding commit:

https://github.com/krishven/dmdedup/commit/cc69ddc2f9276ab2aa9fde7acab0d1ff9d7b315a

venkrishr commented 7 years ago

I have a question:

We don't do bio_get() for our clone anywhere explicitly but call bio_put() explicitly when there is an error or during my_endio().

I checked bio_kmalloc() and bio_add_page() that are called in create_bio(). There is no bio_get() or manual increment of bio->__bi_cnt anywhere.

Is what we are doing the correct approach?

venkrishr commented 7 years ago

Just a clarification.

https://github.com/dmdedup/dmdedup/blob/master/dm-dedup-target.c#L334

The bi_size is checked in the above condition is used to branch into the dm-dedup-rw.c code. The bi_size refers to the total size of the bio i.e a sum total of all the lengths of pages in the bio_vecs.

So mkfs.xfs generates a lot of bios with size lesser than 4096. That's the reason dm-dedup-rw.c code gets called.

I stuck printks to see how the data in these bios are distributed. As far as I've seen, usually they all are present in just a single bio_vec.

sectorsize512 commented 7 years ago

Fixed with 2b2db5fd9cc296520f93eb166b58820f6349cc40

dmdedup / dmdedup3.19

free_pages NULL pointer dereference during mkfs.xfs #18