Closed aparcar closed 6 years ago
valdi74:
I can confirm this bug on Xiaomi Mi Router 3G and USB SATA HDD. Big files (10 GB) downloaded are sometimes (20-30% files) broken - md5sum don't match. There was no log entry when the error occurred. Tested with:
neheb:
Let's see. Not a SATA issue. Not a pcie issue (USB is not connected through pcie). Sounds like a bug introduced in the port to 4.9. Maybe a CPU issue?
HeadLessHUN:
Hi there!
I also faced this bug on xiaomi mi Router 3g on different HDDs with ext4 filesystem.
I've OpenWrt SNAPSHOT r5629-23bba9c release, which equipped with 4.9.72 kernel.
I haven't seen any kernel log which might be relevant to this problem only when mysql tries to acces some block and it can't read...
[28372.317828] EXT4-fs warning: 10 callbacks suppressed
[28372.317846] EXT4-fs warning (device sdb3): htree_dirblock_to_tree:962: inode #872: lblock 0: comm mysqld: error -5 reading directory block
[28372.341743] EXT4-fs warning (device sdb3): htree_dirblock_to_tree:962: inode #872: lblock 0: comm mysqld: error -5 reading directory block
[28372.365581] EXT4-fs warning (device sdb3): dx_probe:742: inode #4312: lblock 0: comm mysqld: error -5 reading directory block
It is very annoying bug, i hope it will be fixed ASAP.
neheb:
Kernel 4.14 should be coming soon. Hopefully it fixes this issue. For all I know, the kernel config could be the issue. Testing is needed...
neheb:
Can you guys test http://lists.infradead.org/pipermail/lede-dev/2018-January/010795.html ?
HeadLessHUN:
i'll try it out but it shouldn't have any impact because it was added to the generic config in june.[[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=b47fd7656336162360ebf66147326763ddae3f8d;hp=415c47de79ada7496c39f435df0b0523472aee58|External Link]], did you change anything else to the master branch?
neheb:
Yeah I did a diff between config-4.4 and config-4.9 and removed newly introduced CONFIGs. It worked. I have firmware on 4.9 that does not show this issue. Unfortunately, I lost the exact config.
I'm currently testing a new one but unfortunately, this testing of bad kernels destroyed my btrfs array. Now I need to rebuild it...
diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9 index 0ea6798..b3c8afc 100644 --- a/target/linux/ramips/mt7621/config-4.9 +++ b/target/linux/ramips/mt7621/config-4.9 @@ -67,7 +67,6 @@ CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_IO=y CONFIG_GENERIC_IRQ_CHIP=y -CONFIG_GENERIC_IRQ_IPI=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PCI_IOMAP=y CONFIG_GENERIC_SCHED_CLOCK=y @@ -105,7 +104,6 @@ CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_GENERIC_DMA_COHERENT=y CONFIG_HAVE_IDE=y -CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y CONFIG_HAVE_KVM=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y @@ -127,7 +125,6 @@ CONFIG_I2C_MT7621=y CONFIG_INITRAMFS_SOURCE="" CONFIG_IRQCHIP=y CONFIG_IRQ_DOMAIN=y -CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_IRQ_MIPS_CPU=y CONFIG_IRQ_WORK=y
and yes, I attributed the error to the wrong CONFIG.
HeadLessHUN:
I commented out these lines
CONFIG_GENERIC_IRQ_IPI=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
and inserted this line to the target/linux/ramips/mt7621/config-4.9.
CONFIG_SCHED_HRTICK=y
Build it and the problem didn't get solved....There are lots of corruption in few hour uptimm.
neheb:
I got rid of a bunch of CONFIG settings out of confg-4.9 but after observing the actual generated .config file in the build directory, there's no difference. So it seems this is a dead-end...
In other news, I seem not to have these issues anymore. I don't know why. The only answer I have is that it was fixed upstream. I can't see what would have done that though... I have working firmware from 4.9.75. I need to do more testing, but this seems to be gone.
Even if placebo, try this patch. It may work, may not...
diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9
index f9765ed..37c2e19 100644
--- a/target/linux/ramips/mt7621/config-4.9
+++ b/target/linux/ramips/mt7621/config-4.9
@@ -12,7 +12,6 @@ CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_ARCH_WANT_IPC_PARSE_VERSION=y
-CONFIG_BLK_MQ_PCI=y
CONFIG_BOARD_SCACHE=y
CONFIG_BOUNCE=y
CONFIG_CEVT_R4K=y
@@ -28,7 +27,6 @@ CONFIG_CMDLINE_BOOL=y
CONFIG_COMMON_CLK=y
CONFIG_CPU_GENERIC_DUMP_TLB=y
CONFIG_CPU_HAS_PREFETCH=y
-CONFIG_CPU_HAS_RIXI=y
CONFIG_CPU_HAS_SYNC=y
CONFIG_CPU_LITTLE_ENDIAN=y
CONFIG_CPU_MIPS32=y
@@ -45,14 +43,8 @@ CONFIG_CPU_SUPPORTS_32BIT_KERNEL=y
CONFIG_CPU_SUPPORTS_HIGHMEM=y
CONFIG_CPU_SUPPORTS_MSA=y
CONFIG_CRC16=y
-CONFIG_CRYPTO_AEAD=y
-CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_DEFLATE=y
-CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_LZO=y
-CONFIG_CRYPTO_MANAGER=y
-CONFIG_CRYPTO_MANAGER2=y
-CONFIG_CRYPTO_NULL2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CSRC_R4K=y
@@ -61,13 +53,11 @@ CONFIG_DMA_NONCOHERENT=y
CONFIG_DTB_RT_NONE=y
CONFIG_DTC=y
CONFIG_EARLY_PRINTK=y
-CONFIG_FIXED_PHY=y
CONFIG_GENERIC_ATOMIC64=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_IO=y
CONFIG_GENERIC_IRQ_CHIP=y
-CONFIG_GENERIC_IRQ_IPI=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_SCHED_CLOCK=y
@@ -77,7 +67,6 @@ CONFIG_GPIOLIB=y
CONFIG_GPIO_MT7621=y
CONFIG_GPIO_SYSFS=y -CONFIG_HANDLE_DOMAIN_IRQ=y CONFIG_HARDWARE_WATCHPOINTS=y CONFIG_HAS_DMA=y CONFIG_HAS_IOMEM=y @@ -89,7 +78,6 @@ CONFIG_HAVE_ARCH_KGDB=y CONFIG_HAVE_ARCH_SECCOMP_FILTER=y CONFIG_HAVE_ARCH_TRACEHOOK=y
-CONFIG_HAVE_CBPF_JIT=y CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_HAVE_CLK=y CONFIG_HAVE_CLK_PREPARE=y @@ -105,7 +93,6 @@ CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_GENERIC_DMA_COHERENT=y CONFIG_HAVE_IDE=y -CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y CONFIG_HAVE_KVM=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y @@ -115,19 +102,16 @@ CONFIG_HAVE_MOD_ARCH_SPECIFIC=y CONFIG_HAVE_NET_DSA=y CONFIG_HAVE_OPROFILE=y CONFIG_HAVE_PERF_EVENTS=y -CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_SYSCALL_TRACEPOINTS=y CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y CONFIG_HIGHMEM=y CONFIG_HW_HAS_PCI=y CONFIG_HZ_PERIODIC=y CONFIG_I2C=y -CONFIG_I2C_BOARDINFO=y CONFIG_I2C_MT7621=y CONFIG_INITRAMFS_SOURCE="" CONFIG_IRQCHIP=y CONFIG_IRQ_DOMAIN=y -CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_IRQ_MIPS_CPU=y CONFIG_IRQ_WORK=y @@ -136,8 +120,6 @@ CONFIG_LZO_COMPRESS=y CONFIG_LZO_DECOMPRESS=y CONFIG_MDIO_BOARDINFO=y CONFIG_MIPS=y -CONFIG_MIPS_ASID_BITS=8 -CONFIG_MIPS_ASID_SHIFT=0 CONFIG_MIPS_CLOCK_VSYSCALL=y CONFIG_MIPS_CM=y
@@ -204,11 +186,9 @@ CONFIG_OF_MDIO=y CONFIG_OF_NET=y CONFIG_OF_PCI=y CONFIG_OF_PCI_IRQ=y -CONFIG_PADATA=y CONFIG_PCI=y CONFIG_PCI_DISABLE_COMMON_QUIRKS=y CONFIG_PCI_DOMAINS=y -CONFIG_PCI_DRIVERS_LEGACY=y CONFIG_PERF_USE_VMALLOC=y CONFIG_PGTABLE_LEVELS=2 CONFIG_PHYLIB=y @@ -223,16 +203,11 @@ CONFIG_RALINK=y
CONFIG_RATIONAL=y CONFIG_RCU_STALL_COMMON=y -CONFIG_REGMAP=y -CONFIG_REGMAP_I2C=y -CONFIG_REGMAP_SPI=y CONFIG_RESET_CONTROLLER=y CONFIG_RFS_ACCEL=y CONFIG_RPS=y CONFIG_RTC_CLASS=y CONFIG_RTC_DRV_PCF8563=y -CONFIG_RTC_I2C_AND_SPI=y -CONFIG_RTC_MC146818_LIB=y
CONFIG_SCHED_SMT=y
@@ -254,7 +229,6 @@ CONFIG_SPI_MT7621=y CONFIG_SRCU=y CONFIG_SWCONFIG_LEDS=y CONFIG_SWCONFIG=y -CONFIG_SWPHY=y CONFIG_SYNC_R4K=y CONFIG_SYSCTL_EXCEPTION_TRACE=y CONFIG_SYS_HAS_CPU_MIPS32_R1=y
HeadLessHUN:
I'm on 4.9.77 r5917-36f1978 and there is still issue with that...
These config removes doesn't needed by anything? openvpn for example crypto support
neheb:
Like I said, this is no-op as all of those options end up in the resulting kernel .config anyway. But I tried it on one of my builds and it seems to have worked? If something breaks you'll instantly know.
neheb:
I gave up. What I did was probably placebo. Just gonna keep ramips at 4.4 in my tree.
Hoping 4.14 (which should come soon) fixes it but I wouldn't hold my breath. If you can, run a ramips unit for several days and compare "md5sum /dev/mtdblock[0123456] " to see if they change. I bet they do. Unfortunately, I don't think anyone cares even though this is a potentially huge issue.
HeadLessHUN:
i will try to run it through several days and save the md5 from all mtdblock, and will share with you, but it should increase the priority...
but it should change for example because of the overlayfs it should be tested on drives which is not changing...
HeadLessHUN:
nah it's getting corrupted (i mean my hdd-s), is it possible to build a snapshot image with 4.4 kernel? Or my only chance is to backport the device to lede 17.01-stable?
neheb:
I'm using 4.4 with trunk. Just copy patches-4.4 and config-4.4 from 17.01 and change the Makefile to use 4.4.
neheb:
@HeadLessHUN a little birdie told me that disabling CONFIG_HIGHMEM fixes this. Could be good to try out.
diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9 index f9765ed..7732443 100644 --- a/target/linux/ramips/mt7621/config-4.9 +++ b/target/linux/ramips/mt7621/config-4.9 @@ -118,7 +118,7 @@ CONFIG_HAVE_PERF_EVENTS=y CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_SYSCALL_TRACEPOINTS=y CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y -CONFIG_HIGHMEM=y +# CONFIG_HIGHMEM is not set CONFIG_HW_HAS_PCI=y CONFIG_HZ_PERIODIC=y CONFIG_I2C=y
easyteacher:
@neheb Does disabling CONFIG_HIGHMEM really work? Have you tested it?
I found a new config introduced in kernel 4.5
[[https://cateee.net/lkddb/web-lkddb/IO_STRICT_DEVMEM.html|CONFIG_IO_STRICT_DEVMEM: Filter I/O access to /dev/mem]]
And will enabling CONFIG_DM_VERITY help?
neheb:
No idea. I've tried it on the 4.4 kernel and it seems to work well. I'm using it for the sd card though (the mmc driver breaks when using the HighMem zone). Could also help here since the issue for me happens after 15+ hours. Maybe when something else tries using the HighMem zone.
I don't think those two options have any impact.
easyteacher:
[[https://events.static.linuxfound.org/sites/events/files/slides/Shuah_Khan_dma_map_error.pdf|Detecting silent data corruptionsand memory leaks using DMA Debug API]]
I found a document possibly related to the bug. To debug, set CONFIG_DMA_API_DEBUG=y. Currently I have no idea how to use it.
neheb:
It seems drivers must be manually modified to use it.
valdi74:
Maybe [[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=79126770868995faa8656f6687a88d385802e34b|this]] is the solution to our problem?
neheb:
Yes.
neheb:
Supply the following if possible:
Basically, with kernel 4.9 there's some weird issue where after several hours (around 18), the SATA controller starts returning bad data. On 4.4, this is not a problem.
I've avoided reporting this problem to kernel.org since ramips is quite LEDE specific. Could be a pcie issue for all I know.
The data on the actual hard drive is fine. It's just bad data that's being returned. Maybe bit errors or something.
The way I test this is by using transmission with its Verify feature. Last I tested with adm + ext4, a torrent that verified at 100% verified at 91% 3 days later.
btrfs is more vocal since it reports silent data corruption and throws checksum mismatch errors in dmesg quite frequently after a few hours.
I currently work around the issue by running kernel 4.4, but this is not a long term solution.