Closed stergiotis closed 5 years ago
Files in Zip: issue364.zip
A bit more context:
In the original program where the bug first surfaced the code
pointer reproducible became inconsistent:
Running in valgrind:
1 ==14008== Invalid read of size 8
2 ==14008== at 0x13A495: S_call_help (schlib.c:218)
3 ==14008== by 0x7AF60BB: ???
4 ==14008== by 0x11BB435C: ???
5 ==14008== by 0x36CABF: ???
6 ==14008== by 0x10C88257: ???
7 ==14008== by 0x10C8828F: ???
8 ==14008== by 0xFFEFFFB7F: ???
9 ==14008== Address 0xfffffffffffffff0 is not stack'd, malloc'd or (recently) free'd
In schlib.c
the code
variable has a value of 0xffffffffffffffed
.
I tried to setup a guardian in order to see if the locked callable gets destroyed and checked the guardian after every invocation of the callable. The guardian always returned #f
.
According to my git bisect
investigation the problem was introduced by merging the pull request #213 into master.
The function S_call_help
in c/schlib.c
has indeed indeed been modified by this commit.
Maybe this of interest to @dybvig and @mflatt.
I think the problem may be on the very last expression of "x86_64.ss". The ,%rbx ,%rbp ,%r12 ,%r13 ,%r14 ,%r15
listing of live registers got lost compared to the old code. Maybe that last expression should be
(asm-c-return ,null-info ,%rbx ,%rbp ,%r12 ,%r13 ,%r14 ,%r15 ,result-regs ...)
If that's right, the last expression of "x86.ss" has the same kind of problem, but it looks like the register list was correctly preserved in other backends.
Edit: The register list above seems wrong for Windows...
Thank you very much for looking into this!
S_call_help
with the invalid address).I tested the full program with older versions of Chez. Even Chez from commit 1d438978dfedef9a47839b3ee302196aa5d8eda1 does reproducible crash the program. Depending on the revision it may or may-not be reproducible using the minimal example above. For 1d438978dfedef9a47839b3ee302196aa5d8eda1 the minimal example crashes too.
My findings today:
(#%$enable-check-heap #t)
);U(x)
,V(x)
,AC0(x)
,...,VIRTREG(x,i)
) didn't bring any insights: The regular and the crashing call do look very similar;CP(x)
(i.e. > 0xffffffffffff) does not necessarily lead to a crash;CP(x)
is independent of MAXPTR
in gc.c;(collect)
from the callable -- in every invocation -- does fully mitigate the problem;(lock-object! callable)
;(object-locked? callable)
is #t
after the crash;Hello Chez-Community!
I am quite stuck with Issue 364. The seamless and efficient FFI interface is one of the many strong points of Chez Scheme. Unfortunately, this bug or, as it may very well just be, my lack of understanding of the invariants of Chez' FFI system, does hinder my progress on a project which I am working on.
I offer a total of 1000 US Dollar bug bounty to anyone resolving Chez Scheme Issue 364 https://github.com/cisco/ChezScheme/issues/364
The issue must be solved for the TA6LE configuration of Chez.
I would be happy to answer any questions on this either publicly on Github or off-the-records by eMail.
scheme --program minimal.scm will crash while scheme --script minimal.scm won't
Have you managed to make a minimal testcase that crashes even with the #365 fix? That would help a lot.
Hi @kalvdans
Thank you very much for your input. The following program crashes equivalently to the real application and to the first minimal example w/o #365.
Folly: I used commit aedf6079e9c464ceb14f1939f8437df3eb2cf5c7
from https://github.com/facebook/folly
GCC: g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
#!/bin/sh
rm -f libloop.so
rm -f *.o
CFLAGS=`pkg-config --cflags libfolly`
LDFLAGS=`pkg-config --libs --static libfolly`
g++ -Wall -Wextra $CFLAGS -fPIC -c libloop.cpp
g++ -Wl,-soname,libloop.so -shared -fPIC -o libloop.so libloop.o $LDFLAGS
#valgrind scheme --program ./minimal.scm
scheme --program ./minimal.scm
#include <folly/dynamic.h>
#include <folly/json.h>
typedef int (func_t)(int i,void* dummy);
extern "C" {
void looper(func_t f) {
int i=0;
while(true) {
std::stringstream ss;
ss << "{\"hello world!\": " << i << "}";
auto r = folly::parseJson(ss.str());
f(i++,static_cast<void*>(&r));
}
}
}
(import (chezscheme))
(define i 0)
(define-ftype mystruct_t (struct (a unsigned-64) (b unsigned-64)))
(define handler-crash (foreign-callable (lambda (n dummy)
(pretty-print (list 'i: i 'n: n))
(set! i (fx1+ i))
#t) (int (* mystruct_t)) boolean))
(collect-notify #t)
(collect-trip-bytes (fx/ (collect-trip-bytes) 8))
(lock-object handler-crash)
(load-shared-object "./libloop.so")
(define aa #f)
(let ((looper (foreign-procedure "looper" (void*) void)))
(set! aa (foreign-callable-entry-point handler-crash))
(looper aa))
Output in Debian 9:
(i: 1786 n: 1786)
(i: 1787 n: 1787)
(i: 1788 n: 1788)
(i: 1789 n: 1789)
(i: 1790 n: 1790)
(i: 1791 n: 1791)
(i: 1792 n: 1792)
==11193== Invalid read of size 8
==11193== at 0x13A495: S_call_help (schlib.c:218)
==11193== by 0x8B08CC1: ???
==11193== by 0x7B0EDAC: ???
==11193== by 0x36CABF: ???
==11193== by 0x7D9DB1F: ???
==11193== by 0x7B7A3FA: ???
==11193== by 0xFFEFFFE9F: ???
==11193== by 0x7B0FC47: ???
==11193== by 0x700: ???
==11193== by 0xFFEFFFC8F: ???
==11193== by 0xFFEFFFE5F: ???
==11193== by 0xFFEFFFC1F: ???
==11193== Address 0x15ffffffd8 is not stack'd, malloc'd or (recently) free'd
Thanks for the new test case, unfortunately I wasn't able to reproduce the crash. I built folly with cmake -D CMAKE_POSITION_INDEPENDENT_CODE=ON
and using g++ (Debian 8.2.0-9) 8.2.0. Size of libloop.so was 6015760 bytes for me. Used non-threaded chez a6le.
@kalvdans
Thanks for looking into this. libloop.so
is only 4493544 Bytes on my host. I attached the output of cmake -D CMAKE_POSITION_INDEPENDENT_CODE=ON ..
on my host.
Note that I am using the threaded version of Chez. In the minimal example mark 2 this seems to be important (i.e. the crash is much later or not there if using a6le instead of ta6le Chez).
Does this diff with respect to 1d43897 help? (I can turn this into another patch for the current version, but this is what I have at the moment.)
diff --git a/s/cpnanopass.ss b/s/cpnanopass.ss
index e0e4354..ed44c3f 100644
--- a/s/cpnanopass.ss
+++ b/s/cpnanopass.ss
@@ -10553,8 +10553,8 @@
(%seq
,(c-init)
,(restore-scheme-state
- (in) ; save just the required registers, e.g., %sfp
- (out %ac0 %ac1 %cp %xp %yp %ts %td scheme-args extra-regs))
+ (in %cp) ; save just the required registers, e.g., %sfp
+ (out %ac0 %ac1 %xp %yp %ts %td scheme-args extra-regs))
; need overflow check since we're effectively retroactively turning
; what was a foreign call into a Scheme non-tail call
(fcallable-overflow-check)
@@ -10569,8 +10569,8 @@
; needs to be a quote, not an immediate
(set! ,(ref-reg %ac1) (literal ,(make-info-literal #f 'object 0 0)))
,(save-scheme-state
- (in %ac0 %ac1)
- (out %cp %xp %yp %ts %td scheme-args extra-regs))
+ (in %ac0 %ac1 %cp)
+ (out %xp %yp %ts %td scheme-args extra-regs))
,(c-scall fv*
(nanopass-case (Ltype Type) result-type
[(fp-scheme-object) (lookup-c-entry Scall->ptr)]
I don't think that's really the right fix, but maybe I at least understand the problem.
When a callable is called from foreign code, then TC(CP) is supposed to reference the Scheme function that previously called the foreign code, so S_call_help
can lock that outer function. The callable stub doesn't make any claims about %cp
, though. If %cp
does end up getting used, then it may also get saved into TC at some point, overwriting the TC(CP) that is supposed to be preserved. For example, if an argument to the callable has to be represented a bignum, then some allocation is involved, and it seems that can trigger a GC that triggers use of %cp
that gets stashed for a while in TC(CP).
I'll work on making the current callable stub explicitly preserve TC(CP), and we'll see if that fixes the problem that you see.
Meanwhile, here's a simpler test that seems to demonstrate the problem reliably on x86_64 platforms with and without threads.
Compile as "cb.so":
/* Be sure to compile with -O2, since the intent of the loop is to
convince the C compiler to store something in the same register
used for CP and for that value to not be a good pointer. */
void call_many_times(void (*f)(long))
{
long a = 1, b = 3, c = 5, d = 7;
long e = 1, g = 3, h = 5, i = 7;
long j = 1, k = 3, l = 5, m = 7;
for (int i = 0; i < 10000000; i++) {
f(0x7000000000000000|(a+e+j)); /* argument is a bignum */
a = b; b = c; c = d; d = e;
e = g; g = h; h = i; i = j;
j = k+2; k = l+2; l = m+2; m = m+2;
}
}
Run in the same directory:
(load-shared-object "./cb.so")
(define call_many_times (foreign-procedure "call_many_times" (void*) void))
(define work
(let ([v 0])
(lambda (n)
;; This loop needs to be mostly non-allocating, but
;; causes varying numbers of ticks to be used up.
(let loop ([n n])
(unless (zero? n)
(set! v (add1 v))
(loop (bitwise-arithmetic-shift-right n 1)))))))
(define handler (foreign-callable work (long) void))
(lock-object handler)
(call_many_times (foreign-callable-entry-point handler))
For now, I've become convinced that the patch above is the right idea, and I've updated #365.
Looks like you, @mflatt, are gonna to claim the bounty! I can confirm that #365 seems to cure the issue (will let the data processing application run during the night). I am very thankful!
Your explanation does indeed explain many of my observations. I recently had a look at the minimal example using rr. I used watch points to find out where the poisonous value was coming from. Surprisingly is was coming directly from the RAX register which means that the code
variable (declared volatile
) was implicitly modified by/during the jump).
Disassembly of the context in S_call_help (added some fprintfs). The line =>
corresponds to CP(tc) = code
in schlib.c
:
0x000055f31b5a7474 <+356>: callq 0x55f31b59a820 <S_error_abort>
0x000055f31b5a7479 <+361>: mov 0xc(%rsp),%eax
0x000055f31b5a747d <+365>: test %eax,%eax
0x000055f31b5a747f <+367>: je 0x55f31b5a74ad <S_call_help+413>
0x000055f31b5a7481 <+369>: mov 0x18(%rsp),%rax
0x000055f31b5a7486 <+374>: cmpq $0x1,0x30(%rax)
0x000055f31b5a748b <+379>: je 0x55f31b5a74ad <S_call_help+413>
0x000055f31b5a748d <+381>: mov 0x18(%rsp),%rax
0x000055f31b5a7492 <+386>: lea 0x1db9f(%rip),%rsi # 0x55f31b5c5038
0x000055f31b5a7499 <+393>: lea 0x1b8b5(%rip),%rdi # 0x55f31b5c2d55
0x000055f31b5a74a0 <+400>: mov 0x30(%rax),%rdx
0x000055f31b5a74a4 <+404>: shl $0x3,%rdx
0x000055f31b5a74a8 <+408>: callq 0x55f31b59ad70 <S_error1>
0x000055f31b5a74ad <+413>: mov 0x10(%rsp),%rdx
0x000055f31b5a74b2 <+418>: mov 0x2281e7(%rip),%rdi # 0x55f31b7cf6a0 <stderr@@GLIBC_2.2.5>
0x000055f31b5a74b9 <+425>: lea 0x1c558(%rip),%rsi # 0x55f31b5c3a18
0x000055f31b5a74c0 <+432>: xor %eax,%eax
0x000055f31b5a74c2 <+434>: callq 0x55f31b57cbf0 <fprintf@plt>
=> 0x000055f31b5a74c7 <+439>: mov 0x18(%rsp),%rax
0x000055f31b5a74cc <+444>: mov 0x10(%rsp),%rdx
0x000055f31b5a74d1 <+449>: mov %rdx,0x40(%rax)
0x000055f31b5a74d5 <+453>: add $0x20,%rsp
0x000055f31b5a74d9 <+457>: pop %rbx
0x000055f31b5a74da <+458>: pop %rbp
0x000055f31b5a74db <+459>: pop %r12
If the repair holds up to scrutiny, there's no need for bounty as far as I am concerned. But if you feel strongly about sending money somewhere, you can send it to the Racket project at the Software Freedom Conservancy: http://racket-lang.org/sfc.html
150 GB of processed data later it is safe to say that @mflatt did a marvelous job. Excellent work, the FFI works flawless now. A big thank to anybody who helped solving and/or investigated this issue.
Stergiotis Ltd. donated 800 USD (500 USD for the fix and 300 USD for the helpful explanations) to the Racket project.
I confirm that Software Freedom Conservancy received today a kind donation from @stergiotis earmarked for our Racket project.
I encountered an invalid memory reference when calling a foreign-callable multiple times from the foreign code. Strangely the bug is only present when a typed ftype-pointer argument is passed to the foreign-callable and not present when only the address (i.e. void*) is passed.
Below a reproducible minimal example:
minimal.scm
libloop.c
run.sh
Beware: Slight alternations of the code (e.g. replacing
alias
by adefine
) seem to mitigate or delay the heap corruption. I am running Chez from the master branch (commita5db479b68b74dda9f62665c44cfad2b1baf322e
) under 64 bit Linux.I would greatly appreciate any insight/help or idea what creates havoc when the GC runs...