Patch for 2.1? - Githubissues

jawj commented 8 years ago

Many thanks for this project – we've found it really useful.

In the README, it says: "There is a bug in version 2.1 that causes the app to hang. Working on a patch."

We tried to update one of our apps to 2.1, and we confirm the hang!

Any idea when you might have a patch ready?

Alternatively, is there any progress you could share that might help us fix this? We are going to need this fixed one way or the other in the next week or two, and any help would be appreciated.

ericwlange commented 8 years ago

Hi. I have been working on version 2.2 (or perhaps 3.0 -- it is a pretty major change). It will add some important capabilities, including JIT on ARM, which I now have working.

I haven't spent a ton of time debugging the hanging bug yet, but I can revert the changes that caused the problem. The issue I was trying to address is that you can't reliably call JS from the main thread because it has a very shallow stack. To address this problem, I added a separate thread to run JS in. Now every JS call queues its payload in this thread and re-syncs with the calling thread when complete. The advantage is twofold: (1) you can make requests from any thread without worry, including the main thread, and (2) you get thread safety for free. If you try to call the same JS context simultaneously from two threads, those calls will be serialized and do the right thing.

The only problem is, it doesn't work right! It works fine with my tests, but I am learning that in some more complex scenarios there is apparently a deadlock. The solution is probably simple, but nobody has given me a test case that I can use to debug! If you have a simple test JS that fails deterministically, I can probably fix the problem very quickly. I would appreciate if you could send something to me. PM me @ eric@flicket.tv if the code is sensitive. I won't divulge it. I just need a test case.

The other alternative is that I can walk back the code changes that "solve" the threading problems. That would mean that you would have to take care of making sure you run from outside the main thread and manage thread safety on your own.

Finally, you could try to fix it yourself if sharing is too cumbersome. The offending code is here: https://github.com/ericwlange/webkit/tree/82034b2c34d7fb11de50d7bc53de6c1aabbc574b/Source/JavaScriptCore/android. The failure happened between the 2.0 and 2.1 tags.

jawj commented 8 years ago

Thanks very much for the comprehensive reply! My colleague Ivo has made some fixes for our use-case (where threading isn't too important) and we'll see if we can contribute anything useful back. We're still having issues on API level 16 though, and I think he may chime in here at some point.

ericwlange commented 8 years ago

Yes, you will have problems with API 16, and this was what the broken code was intended to fix! The main thread's stack is only like 16k (!) deep. In API 16 and earlier it was 12k (!!). That difference is enough to cause JavaScriptCore to crash on older APIs. The workaround is to spawn a synchronous thread and make your JavaScript calls there, where the default stack is 1MB deep. That's what the threading code is supposed to do for you. You can do it explicitly.

The good news is that the next version supports back to API 9 (Gingerbread)! So there is a good deal of value coming in the next release. As for the hanging bug, I have identified two possible failure points in the threading code:

The first is that for some reason I chose to allow two worker threads. I am not sure why. It is possible, though rare, that both threads could be waiting for the other to complete. It takes some mental posturing to figure out the use cases where that could happen, so I am not convinced that is the culprit. To fix this one, you just need to change one digit in JSJSNI.h (https://github.com/ericwlange/webkit/blob/82034b2c34d7fb11de50d7bc53de6c1aabbc574b/Source/JavaScriptCore/android/JSJNI.h):

    JSContextWrapper() { dispatch_q = new DispatchQueue();
        worker_q = new DispatchQueue(2); }

Change the digit from a 2 to a 1.

The much more plausible failure point (and one of the other posts gave me the hint that it hangs in a finalize method somewhere), is when a JSContext is discarded but threads have not yet completed. This case can be fixed in DispatchQueue.cpp (https://github.com/ericwlange/webkit/blob/82034b2c34d7fb11de50d7bc53de6c1aabbc574b/Source/JavaScriptCore/android/DispatchQueue.cpp) by changing:

    while (thread_queue_length(&_queue)) {
        thread_queue_get(&_queue,NULL,&msg);
        if(msg.msgtype == DISPATCH_QUEUE_FUNCTION) {
            delete (funct*)msg.data;
        }
    }

to:

    while (thread_queue_length(&_queue)) {
        thread_queue_get(&_queue,NULL,&msg);
        if(msg.msgtype == DISPATCH_QUEUE_FUNCTION) {
            funct *f = (funct *)msg.data;
            if (f->semaphore) {
                thread_queue_add(f->semaphore, NULL, 0);
            }
            delete f;
        }
    }

in the DispatchThread::run() method.

This was based purely on studying the code, and not from any test case. If you are building AndroidJSCore directly, can you try these mods and let me know if they solve the hanging problem? I would build it myself and post the .aar, but I am knee-deep in the next version and my local repo is completely disconnected from the 2.1 codebase.

Otherwise, I am literally days away from putting out the 2.2-pre version which will include these fixes, as well as JIT, Gingerbread and pave the way to launch cooler features in the future, including browser functionality like DOM parsing and networking.

IvayloDankolov commented 8 years ago

Hello,

I'm George's colleague that's trying to get JSCore working. The last few days have been really bizarre. Let me walk you through some of the things I've discovered.

First, as a bit of background information, we use Javascript as a scripting language heavily in our app: both in controlling the UI and responding to events. It's more tightly focused than something like React Native, but still provides a lot of freedom for JS->Native interaction, sometimes going a few levels deep (configure Screen (Java) -> request user responses from DB (JS) -> run database query (Java) -> unwind the whole thing). The engine is written with the assumption that it's single-threaded, so we'd like to maintain that from the point of view of the main thread, even if it's not true in the low-level implementation.

I tried you cleanup code fix (which indeed seems like it would leave a few threads hanging if the queue is not empty), but it didn't really help, since we only use one context created upon launch. Reducing the thread count on the worker_q didn't help either.

I suspected that there might be some shenanigans going on with functionCallback, since it doesn't switch back and executes on the dispatch_q thread, which result in something weird if it tries to execute something from JSCore and gets bounced to a worker thread (though that was pure speculation at that point). I replaced worker_q with the dispatch_q stubbed out the DispatchQueue used for strings handling. The app still hung up. Sprinkling some logs to see what messages were sent and received wasn't terribly insightful.

At this point I got annoyed with webkit build times and I didn't have the Android Studio project set up to debug JNI so I decided to switch to Java and see what the minimum required changes would be to get it to stop crashing on API 16. Since we're only ever calling Javascript from the main thread I just whipped together a couple of SynchronousQueues and wrote a LockstepThread that bounces execution back and forth every time defer is called. Then I sprinkled some defers in operations like context creation:

ctx = defer(new Functor<Long>() {
  @Override
  public Long call() {
    return create();
  }
});

It quickly got rid of the crash log. The only trouble is, it hung. Logcat was completely unhelpful, showing a pretty standard:

I/JavaScriptCore: Lockstep [Main]: Defer
I/JavaScriptCore: Lockstep [Main]: Send EXECUTE_FUNC
I/JavaScriptCore: Lockstep [Background]: Receive EXECUTE_FUNC
I/JavaScriptCore: Lockstep [Background]: Defer
I/JavaScriptCore: Lockstep [Background]: Send EXECUTE_FUNC

Grabbing a thread dump shows Main still waiting for a message, not having received that second EXECUTE_FUNC, and Background waiting for a result. On one of my test phones, a Samsung Galaxy SII it actually manages to show the activity and hangs pretty reliably when I type in a text field, so I set up some conditional breakpoints to see what's going on.

The results showed nothing out of the ordinary up to the point of the hang when stepping through. Main calls callAsFunction, it bounces to the background, JSCore calls functionCallback, The EXECUTE_FUNC is created, it isn't null, and at that point I can still inspect the foregroundQueue. poll, size and even poll with timeout work properly when calling them from Android Studio. As soon as I step over the put, the queue is broken. Main is still hanging (even though that should not be possible according to how SynchronousQueue is supposed to work), and if I try to poll from the background thread the call hangs as well instead of returning null, in complete violation of the invariant.

During that there are no logs from JNI, no memory dumps or raised signals. Strangest of all, any sort of synchronisation primitive (ArrayBlockingQueue, LinkedBlockingQueue, etc.) hangs as well, even with ArrayBlockingQueue's take and put indices being something like 143 and 144, indicating that there is indeed an item. I even resorted to writing my own Exchanger just to see that I'm not going insane, and it hangs at that exact same spot too.

I could not replicate this at all in non-JNI tests, even bouncing execution back and forth a few hundred times, so I strongly suspect that native code is indeed somehow at fault here, but the logs tell me nothing and the thread dump claims only Java code is executing and only Main and the JSCore worker are ever touching the blocking queues.

I really wish I had a simple and concise use case, but I've been unable to replicate it. Calls into JS (and callbacks into Java and back to JS again) execute just fine, until they don't. And while our app hangs 100% reliably, the initialisation loads thousands of lines of JS and then does hundreds of back-and-forth calls to set notification services, database access, stub setTimeout/console.log, alerts and even a CanvasContext2D API, configure the main screen and more besides. Even if I shared all that code, there's no telling how much of it is a required setup to trigger the bug and which bits exactly are the cause. Just imitating the last code chain does not hang it.

At this point, I'm completely stumped. Any ideas on how we can debug this further?

ericwlange commented 8 years ago

Hi Ivalyo,

Thanks so much for the detailed explanation of what you've tried. This is very helpful.

The first thing I'd like to ask you to do is to start using what's on HEAD. I've updated the README.md with instructions on how to build. Long story short, I built a completely new project called hemroid (https://github.com/ericwlange/hemroid), which is a package manager for Android. It isolates the building of the JavaScriptCore library (as well as about 20-30 other libraries) so that AndroidJSCore can be focused on just the incremental value of the library, rather than my constant machinations in WebKit. My longer term goal is to wrap all of WebKit, not just JavaScriptCore, so that you get fun things like DOM parsing, networking, and HTML for free. This enables developers to start using browser tools like JQuery, for example.

Anyway, I would like to start debugging from that codebase.

Now that I have 2.2 basically working (as well as 2.1 anyway), I am prepared to focus on this particular bug. Let me dive in and see what I can make of this.

ericwlange commented 8 years ago

Hold on a minute. I think I may have found the problem. Please apply the following patch to JSObject.cpp and tell me if it has any impact on the issue. This is a pretty significant memory leak that has existed since 1.0. To identify this, I removed all of the DispatchQueue code and instead made every call synchronous by creating a pthread, executing the payload, and then re-joining the pthread. This effectively created a new thread for every call but waits for the thread to end before continuing. It is not efficient, but I thought it might help accelerate any threading problems. After about 30 or so calls, it crashed. By applying this patch, the crash is fixed. Please try it and tell me if it solves the hanging issue.

diff --git a/AndroidJSCore/AndroidJSCore/jni/JSObject.cpp b/AndroidJSCore/AndroidJSCore/jni/JSObject.cpp
index beebd86..b738394 100644
--- a/AndroidJSCore/AndroidJSCore/jni/JSObject.cpp
+++ b/AndroidJSCore/AndroidJSCore/jni/JSObject.cpp
@@ -82,6 +82,7 @@ Instance::~Instance() {
        jvm->AttachCurrentThread(&env, NULL);
        env->DeleteGlobalRef(thiz);
        objMap[objRef] = NULL;
+       jvm->DetachCurrentThread();
 }
 std::map<JSObjectRef,Instance *> Instance::objMap = std::map<JSObjectRef,Instance *>();
 void Instance::StaticFinalizeCallback(JSObjectRef object)
@@ -107,8 +108,8 @@ void Instance::FinalizeCallback(JSObjectRef object)
                        return;
                }
        } while (true);
-
        env->CallVoidMethod(thiz, mid, (jlong)object);
+       jvm->DetachCurrentThread();
 }
 NATIVE(JSObject,jlong,makeWithFinalizeCallback) (PARAMS, jlong ctx) {
        Instance *instance = new Instance(env, thiz, (JSContextWrapper *)ctx);
@@ -168,6 +169,7 @@ Function::~Function() {
        JNIEnv *env;
        jvm->AttachCurrentThread(&env, NULL);
        env->DeleteGlobalRef(thiz);
+       jvm->DetachCurrentThread();
 }
 void Function::release(__attribute__((unused))JSContextRef ctx, JSObjectRef function) {
        Function *f = objMap[function];
@@ -222,6 +224,7 @@ JSValueRef Function::FunctionCallback(JSContextRef ctx, JSObjectRef function, JS
                        argsArr, (jlong)exception);

        delete args;
+       jvm->DetachCurrentThread();
        return (JSObjectRef)objret;
 }
 JSObjectRef Function::ConstructorCallback(JSContextRef ctx, JSObjectRef constructor,
@@ -251,6 +254,7 @@ JSObjectRef Function::ConstructorCallback(JSContextRef ctx, JSObjectRef construc
                        argsArr, (jlong)exception);

        delete args;
+       jvm->DetachCurrentThread();
        return (JSObjectRef)objret;
 }

ericwlange commented 8 years ago

I've committed this change to HEAD, as is clearly a bug, even if it is not the bug.

ericwlange commented 8 years ago

I have created a 2.2 pre-release. Let me know if that has any impact on the problem.

IvayloDankolov commented 8 years ago

Hello. Nicely spotted, though it doesn't address the deadlock (neither in the native DispatchQueue nor my Java attempt). For our use cases, the JS worker threads basically have the lifetime of the application, so we don't run into the problem of leaking them.

I noticed the reference maps you keep on the native side (i.e. static std::map<JSObjectRef,Function *> objMap;) aren't exactly thread safe, which could cause a problem when a worker and the FinalizerDaemon access them at the same time. I'm investigating if that has any impact on my problem at the moment.

ericwlange commented 8 years ago

That's too bad. I have one more thing for you to try. This is the code I used that uncovered that bug.

It bypasses all of the DispatchQueue stuff and instead creates a new thread for every call, waits for the thread to finish, and then returns. sync and async do the same thing.

This will help isolate where the problem exists. Please let me know what behavior you observe from this. If the problem goes away, that means something is broken in the threading code. If it remains, then I have some ideas on how to isolate the issue further.

--- DispatchQueue.cpp   2016-04-30 16:41:34.000000000 +0530
+++ DispatchQueue.test  2016-05-01 16:21:17.000000000 +0530
@@ -33,24 +33,49 @@
 #include "DispatchQueue.h"

 DispatchQueue::DispatchQueue(unsigned pool) {
+/*
     pool = (pool<1) ? 1 : (pool>16) ? 16:pool;
     _dispatchThreads = new DispatchThread[pool];
     _pool = pool;
+*/
 }

 DispatchQueue::~DispatchQueue() {
-   delete [] _dispatchThreads;
+// delete [] _dispatchThreads;
 }

+struct testme {
+    std::function<void(void *)> func;
+    void *payload;
+};
+
+static void * doitnow(void *p) {
+    struct testme *x = (struct testme *)p;
+    x->func(x->payload);
+    return NULL;
+};
+
 int DispatchQueue::sync(std::function<void(void *)> func, void *payload) {
-   return pickThread()->sync(func,payload);
+   //return pickThread()->sync(func,payload);
+   pthread_t thread;
+    struct testme args = {
+            func,
+            payload
+    };
+
+   pthread_create(&thread, NULL, doitnow, &args);
+    pthread_join(thread,NULL);
+
+    return 0;
 }

 int DispatchQueue::async(std::function<void(void *)> func, void *payload) {
-   return pickThread()->async(func,payload);
+   //return pickThread()->async(func,payload);
+    return sync(func,payload);
 }

 DispatchThread* DispatchQueue::pickThread() {
+   /*
    // If we are already being called from a worker thread, just use that one
    // Otherwise, we could deadlock
    unsigned min = 0;
@@ -63,6 +88,8 @@
        }
    }
    return &_dispatchThreads[min];
+   */
+   return NULL;
 }

 /** class DispatchThread **/

ericwlange commented 8 years ago

Gentlemen, this bug has finally been isolated and swatted. Please try version 2.2-pre2. Another user was able to send me a simple test case that deadlocked, so I was able to track it down after much gnashing of teeth. JNI simply does not like to block on background threads, so I instead removed all of the threading code from C++ and instead implemented a Handler in java.

ericwlange commented 8 years ago

Marking as duplicate of #24

ericwlange / AndroidJSCore

Patch for 2.1? #18