foss-for-synopsys-dwc-arc-processors / linux

Helpful resources for users & developers of Linux kernel for ARC
22 stars 13 forks source link

HS58 Linux: sshd process crashes on "Invalid Read" when issuing a command using Python's Fabric package #168

Closed srusecki closed 5 months ago

srusecki commented 7 months ago

Sometimes when sending commands to Linux over SSH from Python scripts, sshd process on Linux crashes with "Invalid Read" message.

Minimal Python example:

from fabric import Connection

connection = Connection(
    host="10.10.10.10",
    user="root",
    port=22,
    connect_kwargs={"password": "mypassword"})

connection.open()

index = 0
while True:
    for _ in range(10):
        connection.run("uname", hide=True)
        index += 1
    print("Commands count:", index)

After a while of sending uname commands in a loop the sshd crashes. Linux shows the following message:

# potentially unexpected fatal signal 11.
Path: /usr/sbin/sshd
CPU: 2 PID: 22479 Comm: sshd Tainted: G           O      5.15.127 #2
Invalid Read @ 0x0000003c by insn @ 0x20228cee
  @off 0x164cee in [/usr/lib/libcrypto.so.3]  VMA: 0x200c4000 to 0x202f0000
ECR: 0x00050100 EFA: 0x0000003c ERET: 0x20228cee
STAT: 0x80080282 [IE U     ]   BTA: 0x2035fbd0
 SP: 0x5f85f520  FP: 0xbcb494e4 BLK: 0x20229542
r00: 0x400e9540 r01: 0x400e95a8 r02: 0xec61ff54
r03: 0x00000080 r04: 0x27f79973 r05: 0xd830f2ea
r06: 0x9db4cdba r07: 0x6d8e9534 r08: 0x0384c021
r09: 0x4a54ebd5 r10: 0x00000010 r11: 0xa48f14f4
r12: 0x33d6d26f r13: 0x00000000 r14: 0x5f85f588
r15: 0x202e6030 r16: 0x5f85f58c r17: 0x5d1b18c8
r18: 0x49bedbf8 r19: 0x5f85f584 r20: 0x0000001c
r21: 0x25869130 r22: 0x800001f6 r23: 0x81937f17
r24: 0x00000012 r25: 0x00000011

Knowledge so far:

This issue looks similar to https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/163 but is much easier to trigger.

jzbydniewski commented 7 months ago

Setting affinity for the sshd seems to influence this issue. With default affinity set to 0-2 issue reproduces. With affinity set to just single core (0, 1, or 2), issue is not observed. That's how I set affinity

taskset -apc <affinity> <pid>
srusecki commented 7 months ago

So far I've seen the following values in the invalid read address. They are all divisible by 4 and < 0x40: 0x04, 0x08, 0x10, 0x14, 0x1c, 0x24, 0x28, 0x2c, 0x30, 0x34, 0x38, 0x3c.

Full stack traces: crashes.txt

mroz1k commented 6 months ago

Using gdbserver I was able to grab the following information:

The invalid read exception is caused by the instruction ld.as r17,[r16,-17] with $r16 = 0x58, $r16+4*-17 = 0x14, the address is obviously invalid.

More context:

(gdb) bt
#0  sha256_block_data_order (ctx=0x400e9588, in=<optimized out>, num=<optimized out>)
    at crypto/sha/sha256.c:344
#1  0x2022962a in SHA256_Final (
    md=0x400e6500 "$\367\220\001D\237\221\321UH\274\326\177\375#\032\371\3321u\210Q\316h\310\035\376\001\315oF\215", c=0x400e9588) at include/crypto/md32_common.h:219
#2  0x2026f85a in sha256_internal_final (ctx=0x400e9588,
    out=0x400e6500 "$\367\220\001D\237\221\321UH\274\326\177\375#\032\371\3321u\210Q\316h\310\035\376\001\315oF\215", outl=0x5f8dd5a8, outsz=<optimized out>) at providers/implementations/digests/sha2_prov.c:72
#3  0x201c9c40 in BASIC_CONSTRAINTS_free (a=<optimized out>) at crypto/x509/v3_bcons.c:43
Backtrace stopped: frame did not save the PC

and the assembly:

  165292:   2044 2f10               and r16,r16,0x3c
  165296:   2415 2451               add2    r17,r20,r17
  16529a:   2055 2450               add2    r16,r16,0x11
  16529e:   11ef a616               ld.as   r22,[r17,-17]
  1652a2:   2e43 2491               ror r17,r22,0x12
  1652a6:   2e43 21d5               ror r21,r22,0x7
  1652aa:   2000 2350               add r16,r16,r13
  1652ae:   2e41 20d6               lsr r22,r22,0x3
  1652b2:   2507 2455               xor r21,r21,r17
  1652b6:   10ef a611               ld.as   r17,[r16,-17]

From the above we see that the r16 value is truncated by the "and" operation so the only way for it to become a valid pointer is the addition of r13, however gdb reports that $r13 = 0x0.

srusecki commented 6 months ago

I gathered some debug information which hopefully helps. Linux was stopped in the following place and then the sshd crash was reproduced using the example script:

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index e16649811870..137ac5f56ce7 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -179,6 +179,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
                si_code = BUS_ADRERR;
        }
        else {
+               __asm__("flag 1");
                sig = SIGSEGV;
        }

call_stack_core1.txt call_stack_core2.txt call_stack_core3.txt l2tlb_core1.txt l2tlb_core2.txt l2tlb_core3.txt smart_core1.txt smart_core2.txt smart_core3.txt

srusecki commented 6 months ago

I took a look at the ftrace log which was stopped inside do_page_fault function just before the bug occurred. There were calls to do_signal and sys_rt_sigreturn. These functions store/restore user registers, which is suspicious as we have r13 corrupted. I then looked at which regs are saved/restored and it's only r0 to r12. Inside user_regs_struct r13 is marked as callee register, which is inconsistent with the ARCv3 documentation which says that r13 should be caller-saved.

See: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/arch/arc/include/uapi/asm/ptrace.h#L34-L50

Can You comment on that?

xxkent commented 6 months ago

Hi @srusecki, thanks for your debug info and you are right. Pavel also found this issue with r13 reg save/restore and we are on the way of fixing it. We will prepare a patch soon.

srusecki commented 6 months ago

I moved the r13 to appropriate place and saved it correctly. The issue seems to be fixed. @mroz1k also created a test app that reproduces it without the use of ssh and it also works now.

Waiting for an official patch.

srusecki commented 6 months ago

Here's the diff with changes I've made to move r13 into scratch registers region: r13_scratch_diff.txt

jzbydniewski commented 6 months ago

Unfortunately this fix doesn't help for https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/163

mroz1k commented 6 months ago

And here is the simple program that reproduces the r13 corruption: r13-corruption-repro.zip, it must be compiled with -fzero-call-used-regs=all

The patch from @srusecki fixes the issue.

gmazurki commented 6 months ago

Is the patch from the other issue https://github.com/foss-for-synopsys-dwc-arc-processors/linux/files/15313249/regs.patch something official, so we can apply it to finally close the issue? Or you plan to deliver additional changes? We are planning SW release in 2 weeks, so would be good to have it applied sooner than later to have time to test it. Let us know what should we apply to close it.

xxkent commented 6 months ago

Please use this regs.patch in your flow right now. This is not official patch since it can impacts other code in other projects and we are discussing on how to prepare official patch, it won't take too much time. I will let you know about official patch.

xxkent commented 6 months ago

I checked that this patch works with gdbserver: arcv3_r13.patch

But you also need to change 'callee' to 'scratch' for r13 in case of using the patch and gdbserver. https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L188C3-L188C69 and https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L245

shahab-vahedi commented 6 months ago

I checked that this patch works with gdbserver: arcv3_r13.patch

But you also need to change 'callee' to 'scratch' for r13 in case of using the patch and gdbserver. https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L188C3-L188C69 and https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L245

That is correct. I can tuck those changes in our ARCv3 gdbserver as soon as our downstream linux has the patch applied onto its arc64 branch.

Also the user_regs_struct of the ptrace.h inside the linux-headers of arc-gnu-toolchain needs to be adjusted. Those headers are used for building glibc and providing user space headers.

xxkent commented 5 months ago

Hi @shahab-vahedi,

We have merged arcv3_r13.patch patch to arc64 branch. Could you fix also GDB? Thanks.

https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/8fbed423210e791f74fc18dbb010dbbd1234640c

shahab-vahedi commented 5 months ago

The changes are populated to

arc64 branch of binutils-gdb arc-2024.06 branch of bintuilts-gdb arc-2024.06-gdb branch of binutils-gdb master branch of arc-gnu-tool-chain

srusecki commented 5 months ago

Hello @shahab-vahedi,

I see you applied the patch on 3 branches of binutils-gdb repository. Previously we were using released binutils-gdb in our solution, tag arc-2023.09-release. Which one should we use now? Can you make a new release or we need to use a specific commit for now?

abrodkin commented 5 months ago

@srusecki we're currently finalizing work on arc-2024.06 release, which we plan to publish in late July. Given that fix will be included in arc-2024.06 I would suggest using exactly that upcoming release. Will it work for you?

srusecki commented 5 months ago

Ok, we can wait for the release.

However, can you tell which branch we can use (arc64, arc-2024.06 or arc-2024.06-gdb) to check if the issue is fixed on our side?

abrodkin commented 5 months ago

@srusecki we'll use contents of arc-2024.06 branches for the release. At the time of release build generation we'll put git tags on corresponding commits so that the state is known and fixed.

shahab-vahedi commented 5 months ago

Hello @shahab-vahedi,

I see you applied the patch on 3 branches of binutils-gdb repository. Previously we were using released binutils-gdb in our solution, tag arc-2023.09-release. Which one should we use now? Can you make a new release or we need to use a specific commit for now?

Hi @srusecki . The changes I've mentioned in binutils-gdb (and arc-gnu-toolchain) are about addressing the issue that gdb(server) reports 0 for the r13 register. See this comment from @mroz1k .

The fix regarding your original issue (SSHD crashes) are in arc64 branch of our linux repository.

If you know how to build the toolchain and a vmlinux image, then feel free to use all the arc64 branches of the tools (binutils, gcc, glibc, etc.).