stergiotis commented 5 years ago

I encountered an invalid memory reference when calling a foreign-callable multiple times from the foreign code. Strangely the bug is only present when a typed ftype-pointer argument is passed to the foreign-callable and not present when only the address (i.e. void*) is passed.

Below a reproducible minimal example:

minimal.scm

(import (chezscheme))

(define i 0)
(define myguardian (make-guardian))
(define-ftype mystruct_t (struct (a unsigned-64) (b unsigned-64)))
(define handler-crash (foreign-callable (lambda (dummy n)
                                    (set! i (fx1+ i))
                                    (pretty-print i)
                                    #t) ((* mystruct_t) unsigned-64) boolean))
(define handler-nocrash1 (foreign-callable (lambda (dummy n)
                                    (set! i (fx1+ i))
                                    (pretty-print dummy)
                                    (pretty-print i)
                                    #t) ((* mystruct_t) unsigned-64) boolean))
(define handler-nocrash2 (foreign-callable (lambda (dummy n)
                                    (set! i (fx1+ i))
                                    (pretty-print i)
                                    #t) (void* unsigned-64) boolean))

(alias handler handler-crash)
;(alias handler handler-nocrash1)
;(alias handler handler-nocrash2)
;(define handler handler-crash) ;; --> no crash

(collect-notify #t)
(collect-trip-bytes (fx/ (collect-trip-bytes) 8))

(lock-object handler)
(load-shared-object "./libloop.so")
(define aa #f)
(let ((looper (foreign-procedure "looper" (void*) void)))
  (set! aa (foreign-callable-entry-point handler))
  (looper aa))

libloop.c

typedef int (func_t)(int i,int dummy);
void looper(func_t f) {
  int i=0;
  while(f(i++,0)) {
    ;
  }
}

run.sh

#!/bin/sh
gcc -c libloop.c
gcc -Wl,-soname,libloop.so -shared -fPIC -o libloop.so libloop.o

scheme --program ./minimal.scm

Beware: Slight alternations of the code (e.g. replacing alias by a define) seem to mitigate or delay the heap corruption. I am running Chez from the master branch (commit a5db479b68b74dda9f62665c44cfad2b1baf322e) under 64 bit Linux.

I would greatly appreciate any insight/help or idea what creates havoc when the GC runs...

stergiotis commented 5 years ago

Files in Zip: issue364.zip

stergiotis commented 5 years ago

A bit more context: In the original program where the bug first surfaced the code pointer reproducible became inconsistent:

Running in valgrind:

  1 ==14008== Invalid read of size 8                                                                                                  
  2 ==14008==    at 0x13A495: S_call_help (schlib.c:218)                                                                              
  3 ==14008==    by 0x7AF60BB: ???                                                                                                    
  4 ==14008==    by 0x11BB435C: ???                                                                                                   
  5 ==14008==    by 0x36CABF: ???                                                                                                     
  6 ==14008==    by 0x10C88257: ???                                                                                                   
  7 ==14008==    by 0x10C8828F: ???                                                                                                   
  8 ==14008==    by 0xFFEFFFB7F: ???                                                                                                  
  9 ==14008==  Address 0xfffffffffffffff0 is not stack'd, malloc'd or (recently) free'd

In schlib.c the code variable has a value of 0xffffffffffffffed.

I tried to setup a guardian in order to see if the locked callable gets destroyed and checked the guardian after every invocation of the callable. The guardian always returned #f.

stergiotis commented 5 years ago

According to my git bisect investigation the problem was introduced by merging the pull request #213 into master. The function S_call_help in c/schlib.c has indeed indeed been modified by this commit. Maybe this of interest to @dybvig and @mflatt.

mflatt commented 5 years ago

I think the problem may be on the very last expression of "x86_64.ss". The ,%rbx ,%rbp ,%r12 ,%r13 ,%r14 ,%r15 listing of live registers got lost compared to the old code. Maybe that last expression should be

(asm-c-return ,null-info ,%rbx ,%rbp ,%r12 ,%r13 ,%r14 ,%r15 ,result-regs ...)

If that's right, the last expression of "x86.ss" has the same kind of problem, but it looks like the register list was correctly preserved in other backends.

Edit: The register list above seems wrong for Windows...

stergiotis commented 5 years ago

Thank you very much for looking into this!

365 seems to be a step in the right direction as the supplied minimal example does not seem to crash anymore. The real application, which passes a pointer to a struct to the callable, does however still crash (still `S_call_help` with the invalid address).

stergiotis commented 5 years ago

I tested the full program with older versions of Chez. Even Chez from commit 1d438978dfedef9a47839b3ee302196aa5d8eda1 does reproducible crash the program. Depending on the revision it may or may-not be reproducible using the minimal example above. For 1d438978dfedef9a47839b3ee302196aa5d8eda1 the minimal example crashes too.

stergiotis commented 5 years ago

My findings today:

The heap does not seem to be (globally) corrupted (according to (#%$enable-check-heap #t));
Outputting all thread-context/stack data (e.g. U(x),V(x),AC0(x),...,VIRTREG(x,i)) didn't bring any insights: The regular and the crashing call do look very similar;
Oddly an unusual high value in CP(x) (i.e. > 0xffffffffffff) does not necessarily lead to a crash;
The value of CP(x) is independent of MAXPTR in gc.c;
Calling (collect) from the callable -- in every invocation -- does fully mitigate the problem;
The problem occurs independent of how many times the callable has been locked using (lock-object! callable);
(object-locked? callable) is #t after the crash;
365 does not fix the issue.

stergiotis commented 5 years ago

Hello Chez-Community!

I am quite stuck with Issue 364. The seamless and efficient FFI interface is one of the many strong points of Chez Scheme. Unfortunately, this bug or, as it may very well just be, my lack of understanding of the invariants of Chez' FFI system, does hinder my progress on a project which I am working on.

I offer a total of 1000 US Dollar bug bounty to anyone resolving Chez Scheme Issue 364 https://github.com/cisco/ChezScheme/issues/364

500 USD for a pull-request that passes the continuous-integration/travis-ci/pr checks and solves the issue for the given minimal example and for my real-world program (which cannot be shared).
300 USD for explaining the cause and the resolution in english and making it publicly available (e.g. in github issue or in this user group).
200 USD for explaining the applied debugging method which lead to the resolution of the bug.

The issue must be solved for the TA6LE configuration of Chez.

I would be happy to answer any questions on this either publicly on Github or off-the-records by eMail.

ChaosEternal commented 5 years ago

scheme --program minimal.scm will crash while scheme --script minimal.scm won't

kalvdans commented 5 years ago

Have you managed to make a minimal testcase that crashes even with the #365 fix? That would help a lot.

stergiotis commented 5 years ago

Hi @kalvdans

Thank you very much for your input. The following program crashes equivalently to the real application and to the first minimal example w/o #365.

Folly: I used commit aedf6079e9c464ceb14f1939f8437df3eb2cf5c7 from https://github.com/facebook/folly GCC: g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

#!/bin/sh
rm -f libloop.so
rm -f *.o
CFLAGS=`pkg-config --cflags libfolly`
LDFLAGS=`pkg-config --libs --static libfolly`
g++ -Wall -Wextra $CFLAGS -fPIC -c libloop.cpp
g++ -Wl,-soname,libloop.so -shared -fPIC -o libloop.so libloop.o $LDFLAGS

#valgrind scheme --program ./minimal.scm
scheme --program ./minimal.scm

#include <folly/dynamic.h>
#include <folly/json.h>

typedef int (func_t)(int i,void* dummy);
extern "C" {
  void looper(func_t f) {
    int i=0;
    while(true) {
      std::stringstream ss;
      ss << "{\"hello world!\": " << i << "}";
      auto r = folly::parseJson(ss.str());
      f(i++,static_cast<void*>(&r));
    }
  }
}

(import (chezscheme))

(define i 0)
(define-ftype mystruct_t (struct (a unsigned-64) (b unsigned-64)))
(define handler-crash (foreign-callable (lambda (n dummy)
                                    (pretty-print (list 'i: i 'n: n))
                                    (set! i (fx1+ i))
                                    #t) (int (* mystruct_t)) boolean))

(collect-notify #t)
(collect-trip-bytes (fx/ (collect-trip-bytes) 8))

(lock-object handler-crash)
(load-shared-object "./libloop.so")
(define aa #f)
(let ((looper (foreign-procedure "looper" (void*) void)))
  (set! aa (foreign-callable-entry-point handler-crash))
  (looper aa))

Output in Debian 9:

(i: 1786 n: 1786)
(i: 1787 n: 1787)
(i: 1788 n: 1788)
(i: 1789 n: 1789)
(i: 1790 n: 1790)
(i: 1791 n: 1791)
(i: 1792 n: 1792)
==11193== Invalid read of size 8
==11193==    at 0x13A495: S_call_help (schlib.c:218)
==11193==    by 0x8B08CC1: ???
==11193==    by 0x7B0EDAC: ???
==11193==    by 0x36CABF: ???
==11193==    by 0x7D9DB1F: ???
==11193==    by 0x7B7A3FA: ???
==11193==    by 0xFFEFFFE9F: ???
==11193==    by 0x7B0FC47: ???
==11193==    by 0x700: ???
==11193==    by 0xFFEFFFC8F: ???
==11193==    by 0xFFEFFFE5F: ???
==11193==    by 0xFFEFFFC1F: ???
==11193==  Address 0x15ffffffd8 is not stack'd, malloc'd or (recently) free'd

kalvdans commented 5 years ago

Thanks for the new test case, unfortunately I wasn't able to reproduce the crash. I built folly with cmake -D CMAKE_POSITION_INDEPENDENT_CODE=ON and using g++ (Debian 8.2.0-9) 8.2.0. Size of libloop.so was 6015760 bytes for me. Used non-threaded chez a6le.

stergiotis commented 5 years ago

@kalvdans Thanks for looking into this. libloop.so is only 4493544 Bytes on my host. I attached the output of cmake -D CMAKE_POSITION_INDEPENDENT_CODE=ON .. on my host.

Note that I am using the threaded version of Chez. In the minimal example mark 2 this seems to be important (i.e. the crash is much later or not there if using a6le instead of ta6le Chez).

build_log.txt

mflatt commented 5 years ago

Does this diff with respect to 1d43897 help? (I can turn this into another patch for the current version, but this is what I have at the moment.)

diff --git a/s/cpnanopass.ss b/s/cpnanopass.ss
index e0e4354..ed44c3f 100644
--- a/s/cpnanopass.ss
+++ b/s/cpnanopass.ss
@@ -10553,8 +10553,8 @@
                           (%seq
                             ,(c-init)
                             ,(restore-scheme-state
-                               (in) ; save just the required registers, e.g., %sfp
-                               (out %ac0 %ac1 %cp %xp %yp %ts %td scheme-args extra-regs))
+                               (in %cp) ; save just the required registers, e.g., %sfp
+                               (out %ac0 %ac1 %xp %yp %ts %td scheme-args extra-regs))
                             ; need overflow check since we're effectively retroactively turning
                             ; what was a foreign call into a Scheme non-tail call
                             (fcallable-overflow-check)
@@ -10569,8 +10569,8 @@
                             ; needs to be a quote, not an immediate
                             (set! ,(ref-reg %ac1) (literal ,(make-info-literal #f 'object 0 0)))
                             ,(save-scheme-state
-                               (in %ac0 %ac1)
-                               (out %cp %xp %yp %ts %td scheme-args extra-regs))
+                               (in %ac0 %ac1 %cp)
+                               (out %xp %yp %ts %td scheme-args extra-regs))
                             ,(c-scall fv*
                                (nanopass-case (Ltype Type) result-type
                                  [(fp-scheme-object) (lookup-c-entry Scall->ptr)]

mflatt commented 5 years ago

I don't think that's really the right fix, but maybe I at least understand the problem.

When a callable is called from foreign code, then TC(CP) is supposed to reference the Scheme function that previously called the foreign code, so S_call_help can lock that outer function. The callable stub doesn't make any claims about %cp, though. If %cp does end up getting used, then it may also get saved into TC at some point, overwriting the TC(CP) that is supposed to be preserved. For example, if an argument to the callable has to be represented a bignum, then some allocation is involved, and it seems that can trigger a GC that triggers use of %cp that gets stashed for a while in TC(CP).

I'll work on making the current callable stub explicitly preserve TC(CP), and we'll see if that fixes the problem that you see.

Meanwhile, here's a simpler test that seems to demonstrate the problem reliably on x86_64 platforms with and without threads.

Compile as "cb.so":

/* Be sure to compile with -O2, since the intent of the loop is to
   convince the C compiler to store something in the same register
   used for CP and for that value to not be a good pointer. */

void call_many_times(void (*f)(long))
{
  long a = 1, b = 3, c = 5, d = 7;
  long e = 1, g = 3, h = 5, i = 7;  
  long j = 1, k = 3, l = 5, m = 7;

  for (int i = 0; i < 10000000; i++) {
    f(0x7000000000000000|(a+e+j)); /* argument is a bignum */
    a = b; b = c; c = d; d = e;
    e = g; g = h; h = i; i = j;
    j = k+2; k = l+2; l = m+2; m = m+2;
  }
}

Run in the same directory:

(load-shared-object "./cb.so")
(define call_many_times (foreign-procedure "call_many_times" (void*) void))

(define work
  (let ([v 0])
    (lambda (n)
      ;; This loop needs to be mostly non-allocating, but
      ;; causes varying numbers of ticks to be used up.
      (let loop ([n n])
        (unless (zero? n)
          (set! v (add1 v))
          (loop (bitwise-arithmetic-shift-right n 1)))))))

(define handler (foreign-callable work (long) void))
(lock-object handler)

(call_many_times (foreign-callable-entry-point handler))

mflatt commented 5 years ago

For now, I've become convinced that the patch above is the right idea, and I've updated #365.

stergiotis commented 5 years ago

Looks like you, @mflatt, are gonna to claim the bounty! I can confirm that #365 seems to cure the issue (will let the data processing application run during the night). I am very thankful!

Your explanation does indeed explain many of my observations. I recently had a look at the minimal example using rr. I used watch points to find out where the poisonous value was coming from. Surprisingly is was coming directly from the RAX register which means that the code variable (declared volatile) was implicitly modified by/during the jump).

Disassembly of the context in S_call_help (added some fprintfs). The line => corresponds to CP(tc) = code in schlib.c:

   0x000055f31b5a7474 <+356>:   callq  0x55f31b59a820 <S_error_abort>
   0x000055f31b5a7479 <+361>:   mov    0xc(%rsp),%eax
   0x000055f31b5a747d <+365>:   test   %eax,%eax
   0x000055f31b5a747f <+367>:   je     0x55f31b5a74ad <S_call_help+413>
   0x000055f31b5a7481 <+369>:   mov    0x18(%rsp),%rax
   0x000055f31b5a7486 <+374>:   cmpq   $0x1,0x30(%rax)
   0x000055f31b5a748b <+379>:   je     0x55f31b5a74ad <S_call_help+413>
   0x000055f31b5a748d <+381>:   mov    0x18(%rsp),%rax
   0x000055f31b5a7492 <+386>:   lea    0x1db9f(%rip),%rsi        # 0x55f31b5c5038
   0x000055f31b5a7499 <+393>:   lea    0x1b8b5(%rip),%rdi        # 0x55f31b5c2d55
   0x000055f31b5a74a0 <+400>:   mov    0x30(%rax),%rdx
   0x000055f31b5a74a4 <+404>:   shl    $0x3,%rdx
   0x000055f31b5a74a8 <+408>:   callq  0x55f31b59ad70 <S_error1>
   0x000055f31b5a74ad <+413>:   mov    0x10(%rsp),%rdx
   0x000055f31b5a74b2 <+418>:   mov    0x2281e7(%rip),%rdi        # 0x55f31b7cf6a0 <stderr@@GLIBC_2.2.5>
   0x000055f31b5a74b9 <+425>:   lea    0x1c558(%rip),%rsi        # 0x55f31b5c3a18
   0x000055f31b5a74c0 <+432>:   xor    %eax,%eax
   0x000055f31b5a74c2 <+434>:   callq  0x55f31b57cbf0 <fprintf@plt>
=> 0x000055f31b5a74c7 <+439>:   mov    0x18(%rsp),%rax
   0x000055f31b5a74cc <+444>:   mov    0x10(%rsp),%rdx
   0x000055f31b5a74d1 <+449>:   mov    %rdx,0x40(%rax)
   0x000055f31b5a74d5 <+453>:   add    $0x20,%rsp
   0x000055f31b5a74d9 <+457>:   pop    %rbx
   0x000055f31b5a74da <+458>:   pop    %rbp
   0x000055f31b5a74db <+459>:   pop    %r12

mflatt commented 5 years ago

If the repair holds up to scrutiny, there's no need for bounty as far as I am concerned. But if you feel strongly about sending money somewhere, you can send it to the Racket project at the Software Freedom Conservancy: http://racket-lang.org/sfc.html

stergiotis commented 5 years ago

150 GB of processed data later it is safe to say that @mflatt did a marvelous job. Excellent work, the FFI works flawless now. A big thank to anybody who helped solving and/or investigated this issue.

Stergiotis Ltd. donated 800 USD (500 USD for the fix and 300 USD for the helpful explanations) to the Racket project.

bkuhn commented 5 years ago

I confirm that Software Freedom Conservancy received today a kind donation from @stergiotis earmarked for our Racket project.

cisco / ChezScheme

$1000 USD Bug Bounty: Invalid memory reference: foreign-callable (gc bug?) #364

365 seems to be a step in the right direction as the supplied minimal example does not seem to crash anymore. The real application, which passes a pointer to a struct to the callable, does however still crash (still `S_call_help` with the invalid address).

365 does not fix the issue.

cisco / ChezScheme

$1000 USD Bug Bounty: Invalid memory reference: foreign-callable (gc bug?) #364

365 seems to be a step in the right direction as the supplied minimal example does not seem to crash anymore. The real application, which passes a pointer to a struct to the callable, does however still crash (still S_call_help with the invalid address).

365 does not fix the issue.

365 seems to be a step in the right direction as the supplied minimal example does not seem to crash anymore. The real application, which passes a pointer to a struct to the callable, does however still crash (still `S_call_help` with the invalid address).