Closed jmastr closed 1 year ago
Hi, thanks for reporting!
If you run make check V=1
it should tell you which check is causing the bus error. If possible could you run that check under gdb and report the stack trace when it occurs?
Also could you try with the current master or at least v0.8.9?
Sure.
not working setup:
$ dpkg -l|grep 1.20230405-1
ii linux-libc-dev:armhf 1:1.20230405-1 armhf Linux support headers for userspace development
ii raspberrypi-bootloader 1:1.20230405-1 armhf Raspberry Pi bootloader
ii raspberrypi-kernel 1:1.20230405-1 armhf Raspberry Pi bootloader
ii vcdbg 1:1.20230405-1 armhf Tool for debugging VideoCore
$ uname -a
Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
release_0.8.9
:
...
PASS: test_bech32
PASS: test_psbt
PASS: test_psbt_limits
PASS: test_clear
PASS: test_tx
../tools/build-aux/test-driver: line 109: 2918 Bus error "$@" > $log_file 2>&1
FAIL: test_descriptor
============================================================================
Testsuite summary for libwallycore 0.8.9
============================================================================
# TOTAL: 6
# PASS: 5
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
============================================================================
See src/test-suite.log
============================================================================
make[4]: *** [Makefile:1802: test-suite.log] Error 1
make[4]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[3]: *** [Makefile:1910: check-TESTS] Error 2
make[3]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[2]: *** [Makefile:2051: check-am] Error 2
make[2]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[1]: *** [Makefile:1694: check-recursive] Error 1
make[1]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make: *** [Makefile:433: check-recursive] Error 1
master
...
PASS: test_bech32
PASS: test_psbt
PASS: test_psbt_limits
PASS: test_clear
PASS: test_coinselection
PASS: test_tx
../tools/build-aux/test-driver: line 109: 10511 Bus error "$@" > $log_file 2>&1
FAIL: test_descriptor
============================================================================
Testsuite summary for libwallycore 0.8.9
============================================================================
# TOTAL: 7
# PASS: 6
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
============================================================================
See src/test-suite.log
============================================================================
make[4]: *** [Makefile:1867: test-suite.log] Error 1
make[4]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[3]: *** [Makefile:1975: check-TESTS] Error 2
make[3]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[2]: *** [Makefile:2123: check-am] Error 2
make[2]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[1]: *** [Makefile:1759: check-recursive] Error 1
make[1]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make: *** [Makefile:433: check-recursive] Error 1
Great, thanks! looks like the descriptor test is failing with some kind of alignment issue. From your wally source directory, you can run the test under gdb using LD_LIBRARY_PATH=./src/.libs/ gdb src/.libs/test_descriptor
. If you run it under gdb, can you paste the where
output when it crashes?
(gdb) run
...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Program received signal SIGBUS, Bus error.
0xf7dfb518 in ripemd160_done (ctx=0xfffef170, ctx@entry=0xf7dfb580 <ripemd160+60>, res=res@entry=0x24193) at ./ccan/ccan/endian/endian.h:237
237 return CPU_TO_LE32(native);
(gdb) where
#0 0xf7dfb518 in ripemd160_done (ctx=0xfffef170, ctx@entry=0xf7dfb580 <ripemd160+60>, res=res@entry=0x24193) at ./ccan/ccan/endian/endian.h:237
#1 0xf7dfb580 in ripemd160 (ripemd=ripemd@entry=0x24193, p=0xfffef170, p@entry=0xfffef1d0, size=2096, size@entry=32) at ccan/ccan/crypto/ripemd160/ripemd160.c:346
#2 0xf7de0c40 in wally_hash160 (bytes=bytes@entry=0xfffef215 "D\002", bytes_len=<optimized out>, bytes_out=bytes_out@entry=0x24193 "\367\330\327\331", <incomplete sequence \367>, len=len@entry=20) at internal.c:305
#3 0xf7ddedd4 in generate_pk_h (written=0xfffef338, script_len=520, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, node=0x24430, ctx=<optimized out>) at descriptor.c:1076
#4 generate_pkh (ctx=<optimized out>, node=0x24430, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, script_len=520, written=0xfffef338, written@entry=0xfffef330) at descriptor.c:1138
#5 0xf7ddc130 in generate_script (ctx=ctx@entry=0xfffef40c, node=node@entry=0x24430, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, script@entry=0x0, script_len=520, script_len@entry=0, written=written@entry=0xfffef490) at descriptor.c:1796
#6 0xf7ddf78c in node_generate_script (written=0xfffef490, len=0, bytes_out=0x0, index=0, depth=<optimized out>, ctx=0xfffef404) at descriptor.c:2426
#7 wally_descriptor_to_script (descriptor=0x243a0, depth=<optimized out>, index=0, variant=0, multi_index=<optimized out>, multi_index@entry=0, child_num=<optimized out>, child_num@entry=0, flags=<optimized out>, flags@entry=0, bytes_out=<optimized out>, bytes_out@entry=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, len=<optimized out>,
len@entry=520, written=<optimized out>, written@entry=0xfffef488) at descriptor.c:2518
#8 0x0001085c in check_descriptor_to_script (test=<optimized out>) at ctest/test_descriptor.c:1739
#9 main () at ctest/test_descriptor.c:1831
Awesome, thanks. Can you confirm https://github.com/ElementsProject/libwally-core/pull/397 fixes this?
No, unfortunately same error:
(gdb) run
...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Program received signal SIGBUS, Bus error.
0xf7dfb510 in ripemd160_done (ctx=0xfffef178, ctx@entry=0xf7dfb578 <ripemd160+60>, res=res@entry=0x24193) at ./ccan/ccan/endian/endian.h:237
237 return CPU_TO_LE32(native);
(gdb) where
#0 0xf7dfb510 in ripemd160_done (ctx=0xfffef178, ctx@entry=0xf7dfb578 <ripemd160+60>, res=res@entry=0x24193) at ./ccan/ccan/endian/endian.h:237
#1 0xf7dfb578 in ripemd160 (ripemd=ripemd@entry=0x24193, p=0xfffef178, p@entry=0xfffef1d8, size=2096, size@entry=32) at ccan/ccan/crypto/ripemd160/ripemd160.c:346
#2 0xf7de0c3c in wally_hash160 (
bytes=bytes@entry=0xfffef214 "\317)\361\270w\377\064!\002\306\004\177\224A\355}m0E@n\225\300|\330\\w\216K\214\357<\247\253\254\t\271\\p\236\345\276VEhc\t\306\346sm\275\223\224\a\ȃC\323\317)\361\270w\377\064\016,\262\322Y\317d\374\367\060D\002",
bytes_len=<optimized out>, bytes_out=bytes_out@entry=0x24193 "\367\330\327\331", <incomplete sequence \367>, len=len@entry=20) at internal.c:305
#3 0xf7ddedd0 in generate_pk_h (written=0xfffef338, script_len=520, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, node=0x24430, ctx=<optimized out>) at descriptor.c:1077
#4 generate_pkh (ctx=<optimized out>, node=0x24430, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, script_len=520, written=0xfffef338, written@entry=0xfffef330) at descriptor.c:1139
#5 0xf7ddc130 in generate_script (ctx=ctx@entry=0xfffef40c, node=node@entry=0x24430, script=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, script@entry=0x0, script_len=520, script_len@entry=0, written=written@entry=0xfffef490)
at descriptor.c:1797
#6 0xf7ddf788 in node_generate_script (written=0xfffef490, len=0, bytes_out=0x0, index=0, depth=<optimized out>, ctx=0xfffef404) at descriptor.c:2427
#7 wally_descriptor_to_script (descriptor=0x243a0, depth=<optimized out>, index=0, variant=0, multi_index=<optimized out>, multi_index@entry=0, child_num=<optimized out>, child_num@entry=0, flags=<optimized out>, flags@entry=0,
bytes_out=<optimized out>, bytes_out@entry=0x24190 "v\251\024\367\330\327\331", <incomplete sequence \367>, len=<optimized out>, len@entry=520, written=<optimized out>, written@entry=0xfffef488) at descriptor.c:2519
#8 0x0001085c in check_descriptor_to_script (test=<optimized out>) at ctest/test_descriptor.c:1739
#9 main () at ctest/test_descriptor.c:1831
$ git status
On branch descriptor_align
Your branch is up to date with 'origin/descriptor_align'.
nothing to commit, working tree clean
hmm, what does grep HAVE_UNALIGNED_ACCESS src/config.h
show you?
$ grep HAVE_UNALIGNED_ACCESS src/config.h
#define HAVE_UNALIGNED_ACCESS 1
Thanks. seems like configure has incorrectly assumed you can access unaligned memory when in fact you cant. The configure check for that is:
int main(void){static int a[2];return *((int*)(((char*)a)+1)) != 0;}
If you compile and run that: gcc <cflags> foo.c && ./a.out
does it crash?
which <cflags>
shall I set? A plain:
$ gcc foo.c && ./a.out
$ echo $?
0
generated no output, so no crash.
I mean, with kernel 5.15.x it works on the Pi, with 6.1.y it does not. Maybe a bug in the kernel? Just a wild guess...
Hmm, whether or not the CPU can access words on unaligned boundaries is a CPU feature, and shouldn't be affected by the kernel (unless its configurable on arm64?).
edit: It turns out is is configurable on later arm arches
Can you try running c-lightnings configurator check for unaligned access?
#include <string.h>
int main(int argc, char *argv[]) {
char pad[sizeof(int *) * 1];
memcpy(pad, argv[0], sizeof(pad));
int *x = (int *)pad, *y = (int *)(pad + 1);
return *x == *y;
}
The other interesting question is whether HAVE_UNALIGNED_ACCESS
gets set correctly on the other kernel version, and whether the compiled a.out crashes on the other kernel too.
5.15.84:
$ cat foo.c
#include <string.h>
int main(int argc, char *argv[]) {
char pad[sizeof(int *) * 1];
memcpy(pad, argv[0], sizeof(pad));
int *x = (int *)pad, *y = (int *)(pad + 1);
return *x == *y;
}
$ gcc foo.c && ./a.out
$ echo $?
0
6.1.21:
$ uname -a
Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
$ cat foo.c
#include <string.h>
int main(int argc, char *argv[]) {
char pad[sizeof(int *) * 1];
memcpy(pad, argv[0], sizeof(pad));
int *x = (int *)pad, *y = (int *)(pad + 1);
return *x == *y;
}
$ gcc foo.c && ./a.out
$ echo $?
0
5.15.84:
$ grep HAVE_UNALIGNED_ACCESS src/config.h
#define HAVE_UNALIGNED_ACCESS 1
$ cat foo.c
int main(void){static int a[2];return *((int*)(((char*)a)+1)) != 0;}
$ gcc foo.c && ./a.out
$ echo $?
0
Thanks, I'll take this away for a bit and do some local investigation. It seems you do have unaligned access yet you are still able to get a bus error out of this code, very strange.
Hi @jmastr some more things to try:
1) Do the tests pass if you pass --enable-debug
when configuring?
2) The following diff will disable aligned access, I'm interested to know if this fixes your crash. Please apply it on top of the updated https://github.com/ElementsProject/libwally-core/pull/397.
diff --git a/src/ccan_config.h b/src/ccan_config.h
index d74aa5d5..86c0a6f6 100644
--- a/src/ccan_config.h
+++ b/src/ccan_config.h
@@ -29,7 +29,7 @@
#endif
#if HAVE_UNALIGNED_ACCESS
-#define alignment_ok(p, n) 1
+#define alignment_ok(p, n) 0
#else
#define alignment_ok(p, n) ((size_t)(p) % (n) == 0)
#endif
Also I'm wondering if your compiler version is different between the 2 kernel versions?
I am assuming that whats happening is the architecture supports unaligned loads and stores e.g. moving a value to/from unaligned addresses ands register. In hash160, the load that is crashing is used to convert the endianness of the result, using a macro which a smart compiler could easily convert to a byteswap cpu instruction.
However https://developer.arm.com/documentation/ddi0290/g/unaligned-and-mixed-endian-data-access-support/unaligned-access-support/armv6-unaligned-data-access-restrictions states "Swap and synchronization primitives, multiple-word or coprocessor access produce an alignment fault regardless of the setting of the A bit". If the compiler has swapped the instruction and its called on an unaligned address then this would explain the fault.
If that is the cause, then the fix will be to undefine HAVE_UNALIGNED_ACCESS for arm I think.
Hey @jgriffiths
Please have a look at my first comment. I currently use for my builds:
$ ./configure --enable-debug --enable-export-all --enable-swig-python
so no they don't. They do if I remove --enable-swig-python
though...
Change of src/ccan_config.h
on top of origin/descriptor_align
:
$ git diff origin/master
diff --git a/src/ccan_config.h b/src/ccan_config.h
index d74aa5d..86c0a6f 100644
--- a/src/ccan_config.h
+++ b/src/ccan_config.h
@@ -29,7 +29,7 @@
#endif
#if HAVE_UNALIGNED_ACCESS
-#define alignment_ok(p, n) 1
+#define alignment_ok(p, n) 0
#else
#define alignment_ok(p, n) ((size_t)(p) % (n) == 0)
#endif
diff --git a/src/descriptor.c b/src/descriptor.c
index a79c5c4..3a1a2f0 100644
--- a/src/descriptor.c
+++ b/src/descriptor.c
@@ -1062,18 +1062,19 @@ static int generate_pk_k(ms_ctx *ctx, ms_node *node,
static int generate_pk_h(ms_ctx *ctx, ms_node *node,
unsigned char *script, size_t script_len, size_t *written)
{
- unsigned char buff[1 + EC_PUBLIC_KEY_UNCOMPRESSED_LEN];
+ /* Note 4 instead of 1 here to align the data to hash to 32 bits */
+ unsigned char buff[4 + EC_PUBLIC_KEY_UNCOMPRESSED_LEN];
int ret = WALLY_OK;
if (script_len >= WALLY_SCRIPTPUBKEY_P2PKH_LEN - 1) {
- ret = generate_pk_k(ctx, node, buff, sizeof(buff), written);
+ ret = generate_pk_k(ctx, node, buff+3, sizeof(buff)-3, written);
if (ret == WALLY_OK) {
if (node->child->flags & NF_IS_XONLY)
return WALLY_EINVAL;
script[0] = OP_DUP;
script[1] = OP_HASH160;
script[2] = HASH160_LEN;
- ret = wally_hash160(&buff[1], *written - 1, script + 3, HASH160_LEN);
+ ret = wally_hash160(&buff[4], *written - 1, script + 3, HASH160_LEN);
script[3 + HASH160_LEN] = OP_EQUALVERIFY;
}
}
$ uname -a
Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
$ make check
...
OK
.....
----------------------------------------------------------------------
Ran 5 tests in 0.002s
OK make[3]: Leaving directory '/home/rddl/tmp/libwally-core/src' make[2]: Leaving directory '/home/rddl/tmp/libwally-core/src' make[1]: Leaving directory '/home/rddl/tmp/libwally-core/src' make[1]: Entering directory '/home/rddl/tmp/libwally-core' make[1]: Nothing to be done for 'check-am'. make[1]: Leaving directory '/home/rddl/tmp/libwally-core'
That worked! :heavy_check_mark:
3. Compiler versions:
$ uname -a Linux raspberrypi 5.15.84-v7l+ #1613 SMP Thu Jan 5 12:01:26 GMT 2023 armv7l GNU/Linux $ gcc --version gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110 Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
and:
$ uname -a Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux $ gcc --version gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110 Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The thing that I just noticed. The `5.15.84-v7l+` recognizes the processor as `armv7l`, which is 32 bit. After the upgrade to `6.1.21-v8+` it gets recognized as `aarch64`, which is 64 bit. It is the same physical cpu.
The tests also pass by directly applying either:
#if HAVE_UNALIGNED_ACCESS
#define alignment_ok(p, n) 0
#endif
or:
#if HAVE_UNALIGNED_ACCESS
#define alignment_ok(p, n) ((size_t)(p) % (n) == 0)
#endif
to master
. So from my side there is no need for:
commit 7510fe33cc403ec3b950558cdd00b48305b27710
Author: Jon Griffiths <jon_p_griffiths@yahoo.com>
Date: Fri May 19 23:36:48 2023 +1200
descriptor: align pk_h data for hash160
Fixes a crash on platforms which dont support unaligned access. Thanks
@jmastr for the bug report.
But I also don't know if it is necessary somewhere else.
@jmastr Many thanks for helping to debug this.
As suspected, arm64 unaligned access support is more limited than wally requires given the compiler is free to substitute simple loads and stores with more complex instructions. I've updated https://github.com/ElementsProject/libwally-core/pull/397 to cater for this, would you mind testing it?
For coverage purposes could you add --enable-elements
to your build/test run, and try with and without --enable-debug
as well? Thanks!
The thing that I just noticed. The 5.15.84-v7l+ recognizes the processor as armv7l, which is 32 bit. After the upgrade to 6.1.21-v8+ it gets recognized as aarch64, which is 64 bit. It is the same physical cpu.
This seems like an odd change to happen just by a kernel upgrade, given that the processor ends up configured differently. If wally built on the old kernel runs without crashing, and then crashes after the kernel is upgraded, I don't see how that can be seen as anything other than a (typically verboten) user space regression.
@jgriffiths Almost, with:
$ ./configure --enable-debug --enable-export-all --enable-swig-python --enable-elements
$ make check
...
...
PASS: test_bech32
PASS: test_psbt
PASS: test_psbt_limits
PASS: test_clear
PASS: test_coinselection
PASS: test_tx
../tools/build-aux/test-driver: line 109: 8305 Bus error "$@" > $log_file 2>&1
FAIL: test_descriptor
PASS: test_elements_tx
============================================================================
Testsuite summary for libwallycore 0.8.9
============================================================================
# TOTAL: 8
# PASS: 7
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
============================================================================
See src/test-suite.log
============================================================================
make[4]: *** [Makefile:1867: test-suite.log] Error 1
make[4]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[3]: *** [Makefile:1975: check-TESTS] Error 2
make[3]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[2]: *** [Makefile:2123: check-am] Error 2
make[2]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make[1]: *** [Makefile:1759: check-recursive] Error 1
make[1]: Leaving directory '/home/rddl/tmp/libwally-core/src'
make: *** [Makefile:433: check-recursive] Error 1
Now in test-driver
with $ ./configure --enable-export-all --enable-swig-python --enable-elements
it looks basically as above.
@jmastr Thanks, thats progress. Do you mind using gdb on test_tx to determine where the bus error is coming from there? Same procedure as test_descriptor previously.
FAIL: test_descriptor
It's in test_descriptor again
It's simpler than that. On a fresh checkout with configure and everything I get:
$ grep HAVE_UNALIGNED_ACCESS src/config.h
#define HAVE_UNALIGNED_ACCESS 1
So the:
#if defined(HAVE_UNALIGNED_ACCESS) && defined(__aarch64__)
#undef HAVE_UNALIGNED_ACCESS
#define HAVE_UNALIGNED_ACCESS 0
#endif
does not get called properly.
hmm, __aarch64__
can't be the correct define for you.
Can you run echo | gcc -dM -E -
and paste the output here? If you have clang then running the same command for clang would also be helpful, thanks!
$ echo | gcc -dM -E - | sort > gcc.dMe.txt
Thanks, your compiler is building for 32 bit arm rather than 64, so I've swapped the check to 32 bit. Can you please try again with the updated PR?
@jgriffiths works with and without enable-debug
and on both 5.15.84-v7l+
and 6.1.21-v8+
:+1:
Thanks for your effort!
@jmastr Excellent! Many thank for reporting and helping to debug this issue. I've merged the fix and will be making a new release shortly.
Rather than backporting the fix I suggest you stay on master/use the latest release once its made, as there are a large number of bug fixes and some security fixes made since the 0.8.6 version that you were initially testing.
Hi there,
after upgrading my Rasbian Bullseye to latest stable, I get a reproducible Bus error, when running:
on a fresh checkout/autogen.sh/configure:
Took me two days to figure out that this happened due to upgrading
linux-libc-dev
.I downgraded:
To version
1.20230106-1_armhf
andmake check
works again.