dynup / kpatch

kpatch - live kernel patching
GNU General Public License v2.0
1.49k stars 305 forks source link

HAVE_RELIABLE_STACKTRACE vs stack unwinder in 4.15.x #799

Closed KernelKyupCom closed 1 year ago

KernelKyupCom commented 6 years ago

Hi there, It seems that the only way to get usable live patching x86 with 4.15.x is to have the UNWINDER_FRAME_POINTER, since HAVE_RELIABLE_STACKTRACE is selected by: X86 [=y] && X86_64 [=y] && UNWINDER_FRAME_POINTER [=y] && STACK_VALIDATION [=y]

Within my current production config i have the following setup: CONFIG_X86_64=y CONFIG_STACK_VALIDATION=y CONFIG_UNWINDER_ORC=y # CONFIG_UNWINDER_FRAME_POINTER is not set For this reason there is no CONFIG_HAVE_RELIABLE_STACKTRACE. Building a livepatch for that kernel generated livepatch (not kpatch). Trying to load it results in refusal to load it with the following reason: livepatch: This architecture doesn't have support for the livepatch consistency model. Which comes from kernel/livepatch/core.c:839 ` /*

KernelKyupCom commented 6 years ago

From what I've tried it seems that the only way to compile and load the livepatch is to forcibly set the "patch->immediate". What the side effects could be ? This way we are skipping the consistency model, but would it work in the old kpatch manner ?

jpoimboe commented 6 years ago

Yes, unfortunately ORC still doesn't support the consistency model. That support will be coming soon. In the meantime, frame pointers are recommended.

If you have to use ORC, then setting patch->immediate is the only way to do it. Then you have to be more careful with the patch analysis, since old functions can coexist with new functions, even on the same task. (But for most patches, that's not a problem.)

KernelKyupCom commented 6 years ago

I've changed my config, but currently I have a bunch of servers running the already compiled with ORC kernel and it is desirable to be able to livepatch them somehow. Isn't it good idea to have this defined by kpatch-build for x86 in case there is no CONFIG_HAVE_RELIABLESTACKTRACE similarly to the #ifdef (__powerpc_\) ? Last days I had another urgent tasks, so I've been away for a while from the livepatching issues, but I am turning back to it. I'am still fighting the other problem - patching functions containing static variables. It turns out that from build to build the symbol names (specifically the static vars suffixes) will change. This causes missmatch of the symbols when looking for locals. I am still trying to understand how we are never getting such a differences between the two sequential builds of the original and patches kernels, but there are differences between the production and original/patched symbols. I've double checked the toolchain version, there is no difference, the environment is set the same way building the production kernel and livepatch. Even building the kernel and immediately building livepatch for it can result in such a difference. In this case the environment is definitely the same... I've tried to hack it using objcopy --redefine-syms for vmlinux and patched objects, before calling the creare-diff-object then I am failing for the .bss., which I wasn't able to look at so far. Do you think if we correlate the production - original before we are finding the local symbols that would resolve it ?

jpoimboe commented 6 years ago

Isn't it good idea to have this defined by kpatch-build for x86 in case there is no CONFIG_HAVE_RELIABLE_STACKTRACE similarly to the #ifdef (powerpc) ?

Yeah, that's probably a good idea.

I've double checked the toolchain version, there is no difference, the environment is set the same way building the production kernel and livepatch. Even building the kernel and immediately building livepatch for it can result in such a difference. In this case the environment is definitely the same... I've tried to hack it using objcopy --redefine-syms for vmlinux and patched objects, before calling the creare-diff-object then I am failing for the .bss., which I wasn't able to look at so far. Do you think if we correlate the production - original before we are finding the local symbols that would resolve it ?

No. If there's a mismatch in the symbols between the production compiled kernel and the kpatch original compiled kernel, this violates some basic kpatch assumptions. So it wouldn't be wise to hack a fix together without first understanding why there's a mismatch.

As I suggested in my email, save the "production" ip_tables.o file somewhere, and then after you do kpatch-build with -d, save the ~/.kpatch/tmp/orig/.../ip_tables.o file. Copy them somewhere or attach them to this issue so I can compare them.

KernelKyupCom commented 6 years ago

Here they are. I am attaching the diff with all the changes on the source code I have + I have added "-x" to the kpatch-build build script, so the build log contains the script commands executed too. The Archive also contains the production,original and patched ip_tables.o files. kpatch-debug.zip

jpoimboe commented 6 years ago

There seems to be some mismatch, either in your compiler, config, or kernel source.

cleanup_match() in production ip_tables.o:

0000000000000020 <cleanup_match>:
      20:       e8 00 00 00 00          callq  25 <cleanup_match+0x5>
                        21: R_X86_64_PC32       __fentry__-0x4
      25:       55                      push   %rbp
      26:       48 83 c7 20             add    $0x20,%rdi
      2a:       48 89 e5                mov    %rsp,%rbp
      2d:       48 83 ec 20             sub    $0x20,%rsp
      31:       48 8b 47 e8             mov    -0x18(%rdi),%rax
      35:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
      39:       48 89 7d f0             mov    %rdi,-0x10(%rbp)
      3d:       c6 45 f8 02             movb   $0x2,-0x8(%rbp)
      41:       48 8b 50 40             mov    0x40(%rax),%rdx
      45:       48 89 45 e8             mov    %rax,-0x18(%rbp)
      49:       48 85 d2                test   %rdx,%rdx
      4c:       74 0a                   je     58 <cleanup_match+0x38>
      4e:       48 8d 7d e0             lea    -0x20(%rbp),%rdi
      52:       ff d2                   callq  *%rdx
      54:       48 8b 45 e8             mov    -0x18(%rbp),%rax
      58:       48 8b 78 58             mov    0x58(%rax),%rdi
      5c:       e8 00 00 00 00          callq  61 <cleanup_match+0x41>
                        5d: R_X86_64_PC32       module_put-0x4
      61:       c9                      leaveq
      62:       c3                      retq
      63:       0f 1f 00                nopl   (%rax)
      66:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
      6d:       00 00 00

cleanup_match() in orig ip_tables.o:

0000000000000000 <cleanup_match>:
   0:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
                        1: R_X86_64_NONE        __fentry__-0x4
   5:   55                      push   %rbp
   6:   48 83 c7 20             add    $0x20,%rdi
   a:   48 89 e5                mov    %rsp,%rbp
   d:   48 83 ec 20             sub    $0x20,%rsp
  11:   48 8b 47 e8             mov    -0x18(%rdi),%rax
  15:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
  19:   48 89 7d f0             mov    %rdi,-0x10(%rbp)
  1d:   c6 45 f8 02             movb   $0x2,-0x8(%rbp)
  21:   48 8b 50 40             mov    0x40(%rax),%rdx
  25:   48 89 45 e8             mov    %rax,-0x18(%rbp)
  29:   48 85 d2                test   %rdx,%rdx
  2c:   74 0d                   je     3b <cleanup_match+0x3b>
  2e:   48 8d 7d e0             lea    -0x20(%rbp),%rdi
  32:   e8 00 00 00 00          callq  37 <cleanup_match+0x37>
                        33: R_X86_64_PC32       __x86_indirect_thunk_rdx-0x4
  37:   48 8b 45 e8             mov    -0x18(%rbp),%rax
  3b:   48 8b 78 58             mov    0x58(%rax),%rdi
  3f:   e8 00 00 00 00          callq  44 <cleanup_match+0x44>
                        40: R_X86_64_PC32       module_put-0x4
  44:   c9                      leaveq
  45:   c3                      retq

Notice that the production .o has an indirect call (without a retpoline):

      52:       ff d2                   callq  *%rdx

While the orig .o has a retpoline call instead:

  32:   e8 00 00 00 00          callq  37 <cleanup_match+0x37>
                        33: R_X86_64_PC32       __x86_indirect_thunk_rdx-0x4

So the production kernel was either built without CONFIG_RETPOLINE or without a version of GCC which supports retpolines, whereas the kernel built by kpatch-build has both.

KernelKyupCom commented 6 years ago

Hi, Josh. That's a good catch and I was hoping it's over, but both has been built with the same gcc and the .config has the CONFIG_RETPOLINE set. Unfortunately it seems something worse than the config or build environment is wrong:

root@kernighan:/home/*****/releases/linux-4.15.9-test_test1 # uname -a
Linux kernighan 4.15.9-test_test1 #36 SMP Wed Mar 14 09:37:18 EET 2018 x86_64 x86_64 x86_64 GNU/Linux
root@kernighan:/home/*****/releases/linux-4.15.9-test_test1 # zcat /proc/config.gz | grep RETPOL
CONFIG_RETPOLINE=y
root@kernighan:/home/*****/releases/linux-4.15.9-test_test1 # cat .config | grep RETPOL
CONFIG_RETPOLINE=y
root@kernighan:/home/*****/releases # readelf -p .comment /home/*****/releases/vmlinux-4.15.9-test_test1 | grep -o 'GCC:.*'
GCC: (GNU) 7.3.0
root@kernighan:~/ssd/linux-4.15.y # readelf -p .comment net/ipv4/netfilter/ip_tables.o | grep -o 'GCC:.*'
GCC: (GNU) 7.3.0
root@kernighan:~/.kpatch/tmp # find . -name ip_tables.o
./patched/net/ipv4/netfilter/ip_tables.o
./orig/net/ipv4/netfilter/ip_tables.o
root@kernighan:~/.kpatch/tmp # readelf -p .comment ./patched/net/ipv4/netfilter/ip_tables.o | grep -o 'GCC:.*'
GCC: (GNU) 7.3.0
root@kernighan:~/.kpatch/tmp # readelf -p .comment ./orig/net/ipv4/netfilter/ip_tables.o | grep -o 'GCC:.*'
GCC: (GNU) 7.3.0
root@kernighan:~/.kpatch/tmp #

Today I am about to debug the compiler options when building a production kernel and livepatch. Any ideas are welcome.

BR, Angel

jpoimboe commented 6 years ago

Just an idea, but your GCC version is unusually vague. I guess it was built directly from source? You might want to double check that there isn't a mismatch there.

Most distro versions of GCC are much more specific, like gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC).

KernelKyupCom commented 6 years ago

Yup, I've built it on my own a few days after it has been released, since it wasn't available in the package repository. I did it because of the retpolines support. I have no other 7.3, and I've built it once only, so no chance for mismatched toolchain versions.

jpoimboe commented 6 years ago

Ah, ok. I'm out of ideas. It still seems like something was weird with your production build, since the function doesn't use a retpoline. You might want to double check, but it looks like your production kernels are still vulnerable to Spectre v2.

joe-lawrence commented 1 year ago

@KernelKyupCom : there has been no activity on this for a long time, so I'm closing this issue. If still have problems with recent releases, feel free to reopen.