ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.86k stars 906 forks source link

Qualcomm MSM8974PRO-AC ARM: gossipd/test/run-bench-find_route broken by a2fa699 #2818

Open jsarenik opened 5 years ago

jsarenik commented 5 years ago

Issue and Steps to Reproduce

On armv7l, up-to-date Ubuntu 19.04 I get following error (both when DEVELOPER equals 1 and 0) on version starting a2fa699 up to current master:

# gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Bus error
# echo $?
135

git bisect led me to this:

a2fa699e0ea00d77d755248685bc7cfeec522e2f is the first bad commit
commit a2fa699e0ea00d77d755248685bc7cfeec522e2f
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Mon Apr 8 19:28:32 2019 +0930

First I have created a bisect script, then identified that this issue is not present in v0.7.0 but is present in v0.7.1. Here is how I run this bisect:

cat > ~/bisect-gossipd-test-run-bench-find_route.sh <<EOF
#!/bin/sh

{
git clean -xfd
git submodule deinit --all -f
export DEVELOPER=0
./configure || true
make -j4 gossipd/test/run-bench-find_route \
  && gossipd/test/run-bench-find_route
} && echo Success || { echo FAIL; exit 1; }
EOF
chmod a+x ~/bisect*.sh
git bisect start v0.7.1 v0.7.0 --
git bisect run ~/bisect-gossipd-test-run-bench-find_route.sh
git bisect reset

@rustyrussell have a look please as your commit seems to have caused the failing test @NicolasDorier have you noticed this on any other arm? I can try it with Alpine on the same arm (in chroot).

Have a peaceful weekend!

NicolasDorier commented 5 years ago

@jsarenik I only built on arm32, but never tried myself. BTCPayServer does not support clightning on arm32 yet, because we need lightning charge and lightning spark to also support it. (this will be the case in next release)

jsarenik commented 5 years ago

This test does not fail on aarch64 (ARM64) Alpine Linux (musl libc).

$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 1 msec (1576306 nanoseconds per route)
 Length 5: 1
$ echo $?
0
$ uname -a
Linux linaro-developer 4.14.0-qcomlt-arm64 #1 SMP PREEMPT Wed Jan 30 04:14:16 UTC 2019 aarch64 Linux
ZmnSCPxj commented 5 years ago

The commit itself is large and hard to determine what part introduced the issue. Is it possible to run in gdb and get backtrace?

jsarenik commented 5 years ago

Sure, will do.

jsarenik commented 5 years ago

First thing first: I was able to reproduce the issue also on 32-bit ARM running on musl libc. I also did the gdb debugging on this Alpine Linux because there is no issue with debugging symbols like on Ubuntu (which hardwires /lib/ld-linux-armhf.so.3 to binaries on compilation, though this file is a symlink to arm-linux-gnueabihf/ld-2.29.so and the debugging symbols are of course in /usr/lib/debug/lib/arm-linux-gnueabihf/ld-2.29.so which is not found by gdb, and I have tried some magic).

So, here we go:

localhost:~/lightning-auto-test/lightning# uname -a
Linux localhost 3.4.0-lineageos-gb263a89 #1 SMP PREEMPT Wed Oct 24 09:09:32 UTC 2018 armv7l Linux
localhost:~/lightning-auto-test/lightning# git rev-parse --short HEAD          
0ae20399
localhost:~/lightning-auto-test/lightning# ldd gossipd/test/run-bench-find_route
        /lib/ld-musl-armhf.so.1 (0xb6f46000)
        libc.musl-armhf.so.1 => /lib/ld-musl-armhf.so.1 (0xb6f46000)
localhost:~/lightning-auto-test/lightning# gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Bus error
localhost:~/lightning-auto-test/lightning# echo $?
135
localhost:~/lightning-auto-test/lightning# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route 
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...

Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)
    at ccan/ccan/crypto/siphash24/siphash24.c:86
86                              add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) bt                                                                        
#0  0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)                    
    at ccan/ccan/crypto/siphash24/siphash24.c:86                               
#1  0x2a015610 in siphash24_update (ctx=0xbefffaa0, p=0x2a0c5045, size=33)     
    at ccan/ccan/crypto/siphash24/siphash24.c:116                              
#2  0x2a015ee8 in siphash24 (seed=0x2a0be228 <siphashseed>, p=0x2a0c5045,      
    size=33) at ccan/ccan/crypto/siphash24/siphash24.c:169                     
#3  0x2a032eac in node_map_hash_key (pc=0x2a0c5045)                            
    at gossipd/test/../routing.c:214
#4  0x2a031f8c in node_map_get (ht=0x2a0c4904, k=0x2a0c5045)                   
    at gossipd/test/../routing.h:130
#5  0x2a032fa8 in get_node (rstate=0x2a0c4804, id=0x2a0c5045)                  
    at gossipd/test/../routing.c:241
#6  0x2a0336d4 in new_chan (rstate=0x2a0c4804, scid=0xbefffb80,                
    id1=0x2a0c5045, id2=0x2a0c5024, satoshis=...)                              
    at gossipd/test/../routing.c:413
#7  0x2a03bfac in add_connection (rstate=0x2a0c4804, nodes=0x2a0c5024, from=1, 
    to=0, base_fee=436, proportional_fee=944, delay=113)
    at gossipd/test/run-bench-find_route.c:119                                 
#8  0x2a03c228 in populate_random_node (rstate=0x2a0c4804, nodes=0x2a0c5024,   
    n=1) at gossipd/test/run-bench-find_route.c:158
#9  0x2a03c638 in main (argc=1, argv=0xbefffd54)                               
    at gossipd/test/run-bench-find_route.c:226
(gdb) c
Continuing.

Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
localhost:~/lightning-auto-test/lightning# 

All this is on current master (0ae20399).

More thorough debug in the attachment. The file was created by running gdb -batch -n -ex 'set pagination off' -ex 'set logging on' -ex 'echo >> Running the program...\n' -ex run -ex 'echo >> bt\n' -ex bt -ex 'echo >> bt full\n' -ex 'bt full' -ex 'echo >> thread apply all bt full\n' -ex 'thread apply all bt full' -ex 'echo >> c' -ex c --args gossipd/test/run-bench-find_route

gdb.txt

jsarenik commented 5 years ago

Could it be just caused by the funny setup I use (i.e. running chroots on top of Android)?

ZmnSCPxj commented 5 years ago

Can you do disp data at crash point? It might be a "bus error" due to an alignment problem: the device you are running on might not be able to access a u64 at a non-multiple of 4 or 8. The p=0x2a0c5045 means the input address is not aligned, so it might be a misalignment of address that the CPU does not support.

https://en.wikipedia.org/wiki/Bus_error#Unaligned_access

Do you know the exact chipset you are running on?

jsarenik commented 5 years ago

As for the chipset, I hope this helps, if not please hint me what to run.

# cat /proc/cpuinfo 
Processor   : ARMv7 Processor rev 1 (v7l)
processor   : 0
BogoMIPS    : 38.40

processor   : 1
BogoMIPS    : 38.40

processor   : 2
BogoMIPS    : 38.40

processor   : 3
BogoMIPS    : 38.40

Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0x06f
CPU revision    : 1

Hardware    : Qualcomm MSM8974PRO-AC
Revision    : 0000
Serial      : 0000000000000000
jsarenik commented 5 years ago

Some more hardware hints from the host system shell:

cancro:/ # cat /system/build.prop | grep -i MI                                 
ro.product.model=MI Cancro
ro.product.brand=Xiaomi
ro.product.manufacturer=Xiaomi
ro.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys
# from device/xiaomi/cancro/system.prop
rild.libpath=/vendor/lib/libril-qc-qmi-1.so
mm.enable.smoothstreaming=true
ro.fm.transmitter=false
persist.data.qmi.adb_logmask=0
persist.demo.hdmirotationlock=false
ro.hdmi.enable=true
ro.com.google.clientidbase=android-xiaomi
dalvik.vm.heapgrowthlimit=192m
dalvik.vm.heapminfree=2m
ro.bootimage.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys
jsarenik commented 5 years ago

@ZmnSCPxj disp data:

In interactive gdb session:

# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route 
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...

Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffab0, p=0x2a0c5045, len=25)
    at ccan/ccan/crypto/siphash24/siphash24.c:86
86              add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) disp data
1: data = (const unsigned char *) 0x2a0c504d "1\362$\036\335|\035֏0 \263\004\030\064\205\341\351\374\070}K\261\224K\003\017\rcIt|Ŷ\005\230k\342\237\371\206\375\342\364o\030<ۭ\023\222\313\036\251\350\377t\a\003\001\263\263y\220\060,D|r\003\323\342db(\374\201\255\341\366\233\020\201<\223-\305\357\202\061n\003\322\320%\200\200\340K\234\257V\227\371܄\034\364\330J\370\n\303\345\267-\365\363h\210\311Ti/\003\315\350m\363\370\270ʬ\340VC\333K\307\001\177\321\363/;\002\243uE\206\067֛\232\024[\244\003\360\001\232\243+r\241\341≢)\321$^]$ \363\214ۿu\366\224\353\201\376\260\372\264\260\002\347\374\330\b\312\071V", <incomplete sequence \334>...
(gdb) c
Continuing.

Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
# 
jsarenik commented 5 years ago

OK, might be with the chip. I have verified that on iMX6 it works well (and it is also 32-bit):

me@mail:~/lightning-auto-test/lightning$ git rev-parse --short HEAD
0ae20399
me@mail:~/lightning-auto-test/lightning$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 3 msec (3197234 nanoseconds per route)
 Length 8: 1
me@mail:~/lightning-auto-test/lightning$ cat /proc/cpuinfo 
processor   : 0
model name  : ARMv7 Processor rev 10 (v7l)
BogoMIPS    : 3.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc09
CPU revision    : 10

Hardware    : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision    : 61013
Serial      : 0000000000000000
me@mail:~/lightning-auto-test/lightning$ uname -a
Linux mail 4.9.150-imx6-sr #1 SMP Sun Jun 9 06:05:39 UTC 2019 armv7l GNU/Linux
jsarenik commented 5 years ago

Closing it for now. In case someone else faces the same issue, they can add comments, but now I do not think that this is a general issue.

jsarenik commented 5 years ago

@ZmnSCPxj maybe add a label like wontfix or hw-issue?

ZmnSCPxj commented 5 years ago

Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?

jsarenik commented 5 years ago

@ZmnSCPxj any idea how I can reproduce straight on ccan?

jsarenik commented 5 years ago

I will try to run make check on ccan...

ZmnSCPxj commented 5 years ago

Not sure. You might need a boutique test on ccan that specifically performs siphash on an array of char, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:

char buffer[1000];

(void) siphash24(&buffer[1], sizeof(buffer) - 1);

Or maybe malloc it, since char might be allocated by the compiler on unaligned address and the &buffer[1] might accidentally realign. You would have to probe by gdb and breakpoint to the siphash24 function and see the actual pointer. However malloc is assured to return aligned addresses, so specifically misaligning a pointer returned by malloc reliably gives you a misaligned pointer.

jsarenik commented 5 years ago

Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?

I have made https://github.com/rustyrussell/ccan/issues/84

jsarenik commented 5 years ago

Not sure. You might need a boutique test on ccan that specifically performs siphash on an array of char, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:

char buffer[1000];

(void) siphash24(&buffer[1], sizeof(buffer) - 1);

Or maybe malloc it, since char might be allocated by the compiler on unaligned address and the &buffer[1] might accidentally realign. You would have to probe by gdb and breakpoint to the siphash24 function and see the actual pointer. However malloc is assured to return aligned addresses, so specifically misaligning a pointer returned by malloc reliably gives you a misaligned pointer.

Hi @ZmnSCPxj ! Please have a look at https://github.com/jsarenik/siphash24-repro if it makes sense. After compilation it currently ends with Segmentation fault on the CPU which has the alignment issue, but ends successfully on i.MX6. In the meantime I spoke to another man who noticed this issue with alignment on Qualcomm chips years ago and he says it has something to do with the fact it is Krait.

I think that also following issue may be related: https://github.com/tensorflow/tensorflow/issues/19158

jsarenik commented 5 years ago

For reference: https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt

NicolasDorier commented 5 years ago

this should be reopened until the ccan lib has merged your fixed and updated clightning

jsarenik commented 5 years ago

OK. Reopening. Thanks for feed-back @NicolasDorier !

jsarenik commented 4 years ago

Just a ping. The bug is still present in current master (ede5f5be3cc6544bdef39db51b8a39f1821bfccc).

jsarenik commented 4 years ago

@ZmnSCPxj please have a look if the code in https://github.com/jsarenik/siphash24-repro does make any sense.

jsarenik commented 1 year ago

Just an update. I do not have this hardware anymore. It died in the beggining of this year. But not closing (I tried that in the past :)

https://github.com/ElementsProject/lightning/issues/2818#issuecomment-521938432