akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
120 stars 18 forks source link

attempt to access beyond end of device #197

Closed jvinolas closed 5 years ago

jvinolas commented 5 years ago

Hi,

Suddenly our NAS had a problem and we reboot it. After that, we tried to bring up writeboost again an this is the output in dmesg. Raids and disks seem to be ok. Any thoughts on what is happening? It is a hardware issue? Could we try to create the cache again (and will we lose the data?)?

Thanks.

[ 813.419918] attempt to access beyond end of device [ 813.419921] unknown-block(9,1): rw=2048, want=34783398920, limit=468599936 [ 813.419924] device-mapper: writeboost: read_segment_header() I/O error(-5), bits(18446744073709551615), dev(9:1), sector(34783398912), read [ 813.419992] device-mapper: writeboost: find_max_id failed [ 813.420050] device-mapper: writeboost: replay_log_on_cache failed [ 813.420107] device-mapper: writeboost: recover_cache failed [ 813.457709] device-mapper: table: 253:0: writeboost: resume_cache failed [ 813.457767] device-mapper: ioctl: error adding target to table [ 813.464278] kauditd_printk_skb: 10 callbacks suppressed [ 813.464279] audit: type=1130 audit(1539193677.061:150): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=writeboost comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 843.500560] audit: type=1131 audit(1539193707.097:151): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=writeboost comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 900.560494] audit: type=1130 audit(1539193764.157:152): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 900.560498] audit: type=1131 audit(1539193764.157:153): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 923.437283] BUG: unable to handle kernel paging request at ffffab74ae12d058 [ 923.437352] IP: ht_register+0x14/0x60 [dm_writeboost] [ 923.437410] PGD 48e124067 P4D 48e124067 PUD 0 [ 923.437469] Oops: 0002 [#1] SMP [ 923.437526] Modules linked in: dm_writeboost(OE) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl raid456 x86_pkg_temp_thermal intel_powerclamp async_raid6_recov async_memcpy async_pq coretemp async_xor async_tx xor kvm_intel kvm irqbypass raid6_pq crct10dif_pclmul crc32_pclmul libcrc32c crc32c_intel ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf wmi_bmof snd_hda_intel e1000e wdat_wdt raid1 snd_hda_codec ixgbe hci_uart snd_hda_core ptp btbcm i2c_i801 snd_hwdep btqca snd_pcm pps_core btintel mei_me mdio dca mei snd_timer bluetooth snd soundcore shpchp ecdh_generic tpm_infineon intel_lpss_acpi rfkill tpm_tis acpi_als intel_lpss tpm_tis_core kfifo_buf tpm industrialio acpi_pad sunrpc nouveau mxm_wmi i2c_algo_bit drm_kms_helper ttm drm nvme nvme_core wmi video [ 923.439679] i2c_hid [ 923.439983] CPU: 0 PID: 1216 Comm: dmsetup Tainted: G OE 4.14.5-300.fc27.x86_64 #1 [ 923.440537] Hardware name: Gigabyte Technology Co., Ltd. X299 AORUS Gaming 3/X299 AORUS Gaming 3-CF, BIOS F4 06/04/2017 [ 923.441094] task: ffff8c3c38d00000 task.stack: ffffab7081e80000 [ 923.441401] RIP: 0010:ht_register+0x14/0x60 [dm_writeboost] [ 923.441706] RSP: 0018:ffffab7081e83a20 EFLAGS: 00010282 [ 923.442011] RAX: ffffab70add57920 RBX: ffffab7139e367a0 RCX: ffffab7081e83a58 [ 923.442317] RDX: ffffab70ae12d020 RSI: ffffab7139e367a0 RDI: ffffab70ae12d030 [ 923.442623] RBP: ffffab7081e83ae8 R08: ffffab74ae12d058 R09: 0000000044512817 [ 923.442930] R10: 000000000b0d489c R11: ffffab7088d29000 R12: ffff8c3b8b5c1800 [ 923.443235] R13: ffffab70ae12d020 R14: 0000000000000008 R15: 000000000000002f [ 923.443543] FS: 00007f35a58c9840(0000) GS:ffff8c3c5ec00000(0000) knlGS:0000000000000000 [ 923.444097] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 923.444402] CR2: ffffab74ae12d058 CR3: 0000000478d52001 CR4: 00000000003606f0 [ 923.444709] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 923.445015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 923.445322] Call Trace: [ 923.445626] ? do_apply_valid_segments+0x398/0x4b0 [dm_writeboost] [ 923.445935] resume_cache+0x95f/0xdd0 [dm_writeboost] [ 923.446242] writeboost_ctr+0x452/0x7a0 [dm_writeboost] [ 923.446549] dm_table_add_target+0x187/0x370 [ 923.446854] table_load+0x12a/0x350 [ 923.447158] ? retrieve_status+0x1b0/0x1b0 [ 923.447462] ctl_ioctl+0x1ed/0x570 [ 923.447767] dm_ctl_ioctl+0xe/0x20 [ 923.448072] do_vfs_ioctl+0xa5/0x600 [ 923.448377] ? __audit_syscall_entry+0xbf/0x110 [ 923.448683] ? syscall_trace_enter+0x1dd/0x2d0 [ 923.448988] SyS_ioctl+0x79/0x90 [ 923.449292] do_syscall_64+0x67/0x180 [ 923.449597] entry_SYSCALL64_slow_path+0x25/0x25 [ 923.449901] RIP: 0033:0x7f35a5185877 [ 923.450204] RSP: 002b:00007fff73a6f818 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 923.450759] RAX: ffffffffffffffda RBX: 00007f35a5468ec0 RCX: 00007f35a5185877 [ 923.451066] RDX: 000055cceed35bc0 RSI: 00000000c138fd09 RDI: 0000000000000003 [ 923.451372] RBP: 0000000000000000 R08: 00007f35a54a41c8 R09: 00007fff73a6f680 [ 923.451678] R10: 00007f35a54a3683 R11: 0000000000000246 R12: 0000000000000000 [ 923.451984] R13: 000055cceed35bf0 R14: 000055cceed35bc0 R15: 000055cceed359b0 [ 923.452291] Code: 10 74 04 48 89 4a 08 48 89 08 48 89 46 18 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 42 10 4c 8b 42 18 48 8d 7a 10 48 85 c0 <49> 89 00 74 04 4c 89 40 08 48 b8 00 01 00 00 00 00 ad de 48 89 [ 923.453138] RIP: ht_register+0x14/0x60 [dm_writeboost] RSP: ffffab7081e83a20 [ 923.453444] CR2: ffffab74ae12d058 [ 923.453747] ---[ end trace 888ff79e45c2b81b ]---

We did another reboot and this showed in dmesg when trying to start writeboost:

[ 281.947382] attempt to access beyond end of device [ 281.947385] unknown-block(9,1): rw=2048, want=137872120840, limit=468599936 [ 281.947389] device-mapper: writeboost: read_segment_header() I/O error(-5), bits(18446744073709551615), dev(9:1), sector(137872120832), read [ 281.947458] device-mapper: writeboost: find_max_id failed [ 281.947516] device-mapper: writeboost: replay_log_on_cache failed [ 281.947574] device-mapper: writeboost: recover_cache failed [ 281.984883] device-mapper: table: 253:0: writeboost: resume_cache failed [ 281.984941] device-mapper: ioctl: error adding target to table

akiradeveloper commented 5 years ago

Suddenly our NAS had a problem

What was the problem? What was the dmesg? Since the dmesg shared implies accessing rubbish data. I am afraid that the system (CPU or memory) is broken. Please check them first.

Raids and disks seem to be ok.

How did you test this? Please describe the system configuration. Especially, how big the cache device is? See lsblk. Kernel version, Writeboost version, ... everything is missing to understand the situation.

If you want to independently check if the data on cache device is not broken, you can use wbcheck in https://github.com/akiradeveloper/dm-writeboost-tools. Loop over the segments and see every results return success.

jvinolas commented 5 years ago

We did a throughfull test on all hardware and DDR4 memory failed. We replaced memory and writeboost was able to create wbdev again with all the data. That's great, thanks! img-20181011-wa0005

akiradeveloper commented 5 years ago

Nice. I am relieved. Will close.