Open jsarenik opened 5 years ago
@jsarenik I only built on arm32, but never tried myself. BTCPayServer does not support clightning on arm32 yet, because we need lightning charge and lightning spark to also support it. (this will be the case in next release)
This test does not fail on aarch64
(ARM64) Alpine Linux (musl libc).
$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 1 msec (1576306 nanoseconds per route)
Length 5: 1
$ echo $?
0
$ uname -a
Linux linaro-developer 4.14.0-qcomlt-arm64 #1 SMP PREEMPT Wed Jan 30 04:14:16 UTC 2019 aarch64 Linux
The commit itself is large and hard to determine what part introduced the issue. Is it possible to run in gdb
and get backtrace?
Sure, will do.
First thing first: I was able to reproduce the issue also on 32-bit ARM running on musl libc. I also did the gdb debugging on this Alpine Linux because there is no issue with debugging symbols like on Ubuntu (which hardwires /lib/ld-linux-armhf.so.3
to binaries on compilation, though this file is a symlink to arm-linux-gnueabihf/ld-2.29.so
and the debugging symbols are of course in /usr/lib/debug/lib/arm-linux-gnueabihf/ld-2.29.so
which is not found by gdb, and I have tried some magic).
So, here we go:
localhost:~/lightning-auto-test/lightning# uname -a
Linux localhost 3.4.0-lineageos-gb263a89 #1 SMP PREEMPT Wed Oct 24 09:09:32 UTC 2018 armv7l Linux
localhost:~/lightning-auto-test/lightning# git rev-parse --short HEAD
0ae20399
localhost:~/lightning-auto-test/lightning# ldd gossipd/test/run-bench-find_route
/lib/ld-musl-armhf.so.1 (0xb6f46000)
libc.musl-armhf.so.1 => /lib/ld-musl-armhf.so.1 (0xb6f46000)
localhost:~/lightning-auto-test/lightning# gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Bus error
localhost:~/lightning-auto-test/lightning# echo $?
135
localhost:~/lightning-auto-test/lightning# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)
at ccan/ccan/crypto/siphash24/siphash24.c:86
86 add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) bt
#0 0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)
at ccan/ccan/crypto/siphash24/siphash24.c:86
#1 0x2a015610 in siphash24_update (ctx=0xbefffaa0, p=0x2a0c5045, size=33)
at ccan/ccan/crypto/siphash24/siphash24.c:116
#2 0x2a015ee8 in siphash24 (seed=0x2a0be228 <siphashseed>, p=0x2a0c5045,
size=33) at ccan/ccan/crypto/siphash24/siphash24.c:169
#3 0x2a032eac in node_map_hash_key (pc=0x2a0c5045)
at gossipd/test/../routing.c:214
#4 0x2a031f8c in node_map_get (ht=0x2a0c4904, k=0x2a0c5045)
at gossipd/test/../routing.h:130
#5 0x2a032fa8 in get_node (rstate=0x2a0c4804, id=0x2a0c5045)
at gossipd/test/../routing.c:241
#6 0x2a0336d4 in new_chan (rstate=0x2a0c4804, scid=0xbefffb80,
id1=0x2a0c5045, id2=0x2a0c5024, satoshis=...)
at gossipd/test/../routing.c:413
#7 0x2a03bfac in add_connection (rstate=0x2a0c4804, nodes=0x2a0c5024, from=1,
to=0, base_fee=436, proportional_fee=944, delay=113)
at gossipd/test/run-bench-find_route.c:119
#8 0x2a03c228 in populate_random_node (rstate=0x2a0c4804, nodes=0x2a0c5024,
n=1) at gossipd/test/run-bench-find_route.c:158
#9 0x2a03c638 in main (argc=1, argv=0xbefffd54)
at gossipd/test/run-bench-find_route.c:226
(gdb) c
Continuing.
Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
localhost:~/lightning-auto-test/lightning#
All this is on current master (0ae20399).
More thorough debug in the attachment. The file was created by running gdb -batch -n -ex 'set pagination off' -ex 'set logging on' -ex 'echo >> Running the program...\n' -ex run -ex 'echo >> bt\n' -ex bt -ex 'echo >> bt full\n' -ex 'bt full' -ex 'echo >> thread apply all bt full\n' -ex 'thread apply all bt full' -ex 'echo >> c' -ex c --args gossipd/test/run-bench-find_route
Could it be just caused by the funny setup I use (i.e. running chroots on top of Android)?
Can you do disp data
at crash point? It might be a "bus error" due to an alignment problem: the device you are running on might not be able to access a u64
at a non-multiple of 4 or 8. The p=0x2a0c5045
means the input address is not aligned, so it might be a misalignment of address that the CPU does not support.
https://en.wikipedia.org/wiki/Bus_error#Unaligned_access
Do you know the exact chipset you are running on?
As for the chipset, I hope this helps, if not please hint me what to run.
# cat /proc/cpuinfo
Processor : ARMv7 Processor rev 1 (v7l)
processor : 0
BogoMIPS : 38.40
processor : 1
BogoMIPS : 38.40
processor : 2
BogoMIPS : 38.40
processor : 3
BogoMIPS : 38.40
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x2
CPU part : 0x06f
CPU revision : 1
Hardware : Qualcomm MSM8974PRO-AC
Revision : 0000
Serial : 0000000000000000
Some more hardware hints from the host system shell:
cancro:/ # cat /system/build.prop | grep -i MI
ro.product.model=MI Cancro
ro.product.brand=Xiaomi
ro.product.manufacturer=Xiaomi
ro.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys
# from device/xiaomi/cancro/system.prop
rild.libpath=/vendor/lib/libril-qc-qmi-1.so
mm.enable.smoothstreaming=true
ro.fm.transmitter=false
persist.data.qmi.adb_logmask=0
persist.demo.hdmirotationlock=false
ro.hdmi.enable=true
ro.com.google.clientidbase=android-xiaomi
dalvik.vm.heapgrowthlimit=192m
dalvik.vm.heapminfree=2m
ro.bootimage.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys
@ZmnSCPxj disp data
:
In interactive gdb
session:
# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffab0, p=0x2a0c5045, len=25)
at ccan/ccan/crypto/siphash24/siphash24.c:86
86 add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) disp data
1: data = (const unsigned char *) 0x2a0c504d "1\362$\036\335|\035֏0 \263\004\030\064\205\341\351\374\070}K\261\224K\003\017\rcIt|Ŷ\005\230k\342\237\371\206\375\342\364o\030<ۭ\023\222\313\036\251\350\377t\a\003\001\263\263y\220\060,D|r\003\323\342db(\374\201\255\341\366\233\020\201<\223-\305\357\202\061n\003\322\320%\200\200\340K\234\257V\227\371܄\034\364\330J\370\n\303\345\267-\365\363h\210\311Ti/\003\315\350m\363\370\270ʬ\340VC\333K\307\001\177\321\363/;\002\243uE\206\067֛\232\024[\244\003\360\001\232\243+r\241\341≢)\321$^]$ \363\214ۿu\366\224\353\201\376\260\372\264\260\002\347\374\330\b\312\071V", <incomplete sequence \334>...
(gdb) c
Continuing.
Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
#
OK, might be with the chip. I have verified that on iMX6 it works well (and it is also 32-bit):
me@mail:~/lightning-auto-test/lightning$ git rev-parse --short HEAD
0ae20399
me@mail:~/lightning-auto-test/lightning$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 3 msec (3197234 nanoseconds per route)
Length 8: 1
me@mail:~/lightning-auto-test/lightning$ cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 3.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
Hardware : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision : 61013
Serial : 0000000000000000
me@mail:~/lightning-auto-test/lightning$ uname -a
Linux mail 4.9.150-imx6-sr #1 SMP Sun Jun 9 06:05:39 UTC 2019 armv7l GNU/Linux
Closing it for now. In case someone else faces the same issue, they can add comments, but now I do not think that this is a general issue.
@ZmnSCPxj maybe add a label like wontfix
or hw-issue
?
Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?
@ZmnSCPxj any idea how I can reproduce straight on ccan?
I will try to run make check
on ccan...
Not sure. You might need a boutique test on ccan that specifically performs siphash
on an array of char
, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:
char buffer[1000];
(void) siphash24(&buffer[1], sizeof(buffer) - 1);
Or maybe malloc
it, since char
might be allocated by the compiler on unaligned address and the &buffer[1]
might accidentally realign. You would have to probe by gdb and breakpoint to the siphash24
function and see the actual pointer. However malloc
is assured to return aligned addresses, so specifically misaligning a pointer returned by malloc
reliably gives you a misaligned pointer.
Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?
I have made https://github.com/rustyrussell/ccan/issues/84
Not sure. You might need a boutique test on ccan that specifically performs
siphash
on an array ofchar
, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:char buffer[1000]; (void) siphash24(&buffer[1], sizeof(buffer) - 1);
Or maybe
malloc
it, sincechar
might be allocated by the compiler on unaligned address and the&buffer[1]
might accidentally realign. You would have to probe by gdb and breakpoint to thesiphash24
function and see the actual pointer. Howevermalloc
is assured to return aligned addresses, so specifically misaligning a pointer returned bymalloc
reliably gives you a misaligned pointer.
Hi @ZmnSCPxj ! Please have a look at https://github.com/jsarenik/siphash24-repro if it makes sense. After compilation it currently ends with Segmentation fault
on the CPU which has the alignment issue, but ends successfully on i.MX6. In the meantime I spoke to another man who noticed this issue with alignment on Qualcomm chips years ago and he says it has something to do with the fact it is Krait.
I think that also following issue may be related: https://github.com/tensorflow/tensorflow/issues/19158
this should be reopened until the ccan lib has merged your fixed and updated clightning
OK. Reopening. Thanks for feed-back @NicolasDorier !
Just a ping. The bug is still present in current master (ede5f5be3cc6544bdef39db51b8a39f1821bfccc).
@ZmnSCPxj please have a look if the code in https://github.com/jsarenik/siphash24-repro does make any sense.
Just an update. I do not have this hardware anymore. It died in the beggining of this year. But not closing (I tried that in the past :)
https://github.com/ElementsProject/lightning/issues/2818#issuecomment-521938432
Issue and Steps to Reproduce
On
armv7l
, up-to-date Ubuntu 19.04 I get following error (both whenDEVELOPER
equals1
and0
) on version starting a2fa699 up to current master:git bisect
led me to this:First I have created a bisect script, then identified that this issue is not present in
v0.7.0
but is present inv0.7.1
. Here is how I run this bisect:@rustyrussell have a look please as your commit seems to have caused the failing test @NicolasDorier have you noticed this on any other arm? I can try it with Alpine on the same arm (in chroot).
Have a peaceful weekend!