Closed nathanchance closed 2 years ago
LLVM has some known issues with weak aliases and CFI, which we have in the past worked around by changing the code to use a weak function instead.
Something like this appears to work for me. This is more consistent with the rest of the kernel, I do not see any other uses of __weak __alias(...)
.
diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..6ced67378d1d 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -37,9 +37,11 @@ static inline void blake2s_increment_counter(struct blake2s_state *state,
state->t[1] += (state->t[0] < inc);
}
-void blake2s_compress(struct blake2s_state *state, const u8 *block,
- size_t nblocks, const u32 inc)
- __weak __alias(blake2s_compress_generic);
+void __weak blake2s_compress(struct blake2s_state *state, const u8 *block,
+ size_t nblocks, const u32 inc)
+{
+ return blake2s_compress_generic(state, block, nblocks, inc);
+}
void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
size_t nblocks, const u32 inc)
Pending patch: https://git.kernel.org/nathan/c/9fc1d6db9fa21034b927fdd5d16f1c887fdad2a6
@samitolvanen Do you have any reference to the issues with weak aliases and CFI? I see #538 (and its upstream report https://github.com/llvm/llvm-project/issues/41887) but that was with Thin LTO and it seems like https://github.com/llvm/llvm-project/commit/ea314fd476166405905103a29dd8578ff6589160 was the fix for that (and the LLVM issue notes that full LTO was fine). I am just worried upstream is going to respond with "fix the compiler".
Alternative approach: https://lore.kernel.org/linux-crypto/20220119135450.564115-1-Jason@zx2c4.com/
Seems like we'd be much better off just fixing Clang's broken CFI though?
Seems like we'd be much better off just fixing Clang's broken CFI though?
I agree, but I suspect that might take a while. Is this issue trivially reproducible in a stand-alone program? If so, Nathan, would you mind filing a bug with upstream LLVM?
I agree, but I suspect that might take a while.
If you'd like to have an impact on this, please reply to my email on LKML:
https://lore.kernel.org/lkml/CAHmME9ow7TxCaYYayRn9rdJJSdQ48tWQgdrW00g7mHaWvVJ+Zw@mail.gmail.com/
I agree, but I suspect that might take a while. Is this issue trivially reproducible in a stand-alone program? If so, Nathan, would you mind filing a bug with upstream LLVM?
I tried extracting the problematic functions and creating a small reproducer on Wednesday morning but I did not have any luck (there was no CFI failure). I am working through other things in my backlog today, I will try and double back around to this on Monday.
Dunno how to reproduce standalone either, but here's a place to start:
alias.c
int a(void *v) __attribute__((weak)) __attribute__((__alias__("b")));
int b(void *v)
{
return 8 + (unsigned long)v;
}
call.c:
int call(int(*f)(void*), void *v)
{
return f(v);
}
main.c
int call(int(*f)(void*), void *v);
int a(void *);
int main(int argc, char *argv[])
{
return call(a, argv);
}
@nathanchance Any luck making a minimal non-kernel reproducer for this? Once my patch goes upstream, it may sometime after becoming annoying to revert, or hard to remember where to revert to, making this bug harder to fix properly eventually in llvm. So I'd recommend we get that worked out sooner rather than later.
I futzed a bit with the code in https://github.com/ClangBuiltLinux/linux/issues/1567#issuecomment-1018884115 to no avail. I tried this using -flto=full -fvisibility={?} -fuse-ld=lld -fsanitize=cfi
, with various options for {?}
. I also tried a slight variation on this to more closely mirror what's happening in the kernel code:
main.c
#include "call.h"
int a(void *);
int main(int argc, char *argv[])
{
return call(a, argv);
}
call.h
static inline int call(int(*f)(void*), void *v)
{
return f(v);
}
alias.c
int a(void *v) __attribute__((weak)) __attribute__((__alias__("b")));
int b(void *v)
{
return 8 + (unsigned long)v;
}
But this too was not the magic bullet.
Any suggestions?
I have not been able to come up with anything yet. I tried taking the kernel code almost verbatim (minus the actual content of functions) and that did not work.
I did notice that with Jason's example, there is no symbol for the weak alias in the symbol table but there is with the kernel:
$ clang -flto -fsanitize=cfi -fsanitize-cfi-cross-dso -fno-sanitize-cfi-canonical-jump-tables -fno-sanitize-trap=cfi -fno-sanitize-blacklist -fuse-ld=lld -fvisibility=hidden -o test main.c alias.c
$ llvm-readelf -s test &| rg " [a|b](.cfi_jt)*\$"
4519: 0000000000057034 24 FUNC LOCAL DEFAULT 14 b
4525: 000000000005704c 4 FUNC LOCAL DEFAULT 14 b.cfi_jt
$ llvm-readelf -s $CBL_SRC/linux/vmlinux &| rg " blake2s_compress"
753424: ffff80000980a310 8 FUNC LOCAL DEFAULT 2 blake2s_compress_generic.cfi_jt
801095: ffff8000087b0bec 5196 FUNC LOCAL HIDDEN 2 blake2s_compress
801098: ffff8000087b0bec 5196 FUNC LOCAL HIDDEN 2 blake2s_compress_generic
I'll have to see if I can get the reproducer to have that same behavior, as that will likely trigger the bug.
minus the actual content of functions
I wonder if this is related somehow. If the functions are too small or only called from one place, shouldn't LTO in theory attempt to inline it? On the other hand, I'd expect the presence of an indirect call to prevent the inlining?
I wonder if this is related somehow.
It likely is.
We do limit the amount of inlining that happens across translation units with LTO in the main Makefile
: https://git.kernel.org/linus/22d429e75f24d114d99223389d6ba7047e952e32
Unfortunately, removing that does nothing.
I will see if I can just extract all parts of the call chain tomorrow.
I took a whack at trying to get all pieces of the call chain into a standalone reproducer but I got lost trying to detangle drivers/char/random.c
. I tried playing around with the simpler reproducer but could not get it to reproduce the same issue. This issue likely only shows up with multiple translation units and certain function sizes that affects inlining.
For what it's worth, if maintainability of this workaround is a concern, my workaround patch is way simpler, which would make it easier to revert. Additionally, it appears that Sami is working on a kernel-specific CFI implementation, which presumably will not have this issue, but I assume there is not a timeline for when that will be ready.
my workaround patch is way simpler
But it introduces an additional jump in the non LTO case, I think. So I'll move ahead with my patch, and hopefully "5.17-rc1" is a good enough bookmark for this whenever the LTO people want to try to fix it.
Low priority?
Low priority?
I removed the "low priority" from the label.
Can we close this now that upstream no longer uses Clang CFI?
Yes, I think it can be.
Initially reported by CI on
android-mainline
: https://github.com/ClangBuiltLinux/continuous-integration2/runs/4824153007?check_suite_focus=trueThis happens after commit 9f9eff85a008 ("random: use BLAKE2s instead of SHA1 in extraction") (and I assume its prerequisite, commit 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in"), has something to do with this as well).
It does not happen with ThinLTO, which is peculiar (and explains how our CI did not see it in the regular
mainline
builds, as we test CFI with ThinLTO).blake2s_update()
is a wrapper for__blake2s_update()
, which callsblake2s_compress()
as the fourth parameter, which should be of typeblake2s_compress_t
.lib/crypto/blake2s.c
:lib/crypto/blake2s-generic.c
include/crypto/internal/blake2s.h
:I tried changing the fourth parameter of
blake2s_compress_t
fromu32 inc
toconst u32 inc
but that did not make a difference:Could it be something going wrong with the
__weak __alias(blake2s_compress_generic)
part ofblake2s_compress()
?cc @samitolvanen