Stalker python script unload == SEGFAULT

bannsec commented 5 years ago

Looks like there's an issue when calling unload after stalker.unfollow. It produces a SEGFAULT in my linux environment.

Reproduction steps (specifically in python):

Start some app (any) in frida python
Create/load new script to Stalk Process.getCurrentThreadId
Create/load another script to call Stalker.unfollow for the previous thread id
Unload the original stalker script (script.unload())

It appears that, on unload, there's some assumption going on that assumes the thread is still being stalked. This assumption is violated if unstalk was called prior to attempting to unload the hosting stalker script.

For now, my workaround is to not use unfollow. However, in testing it appears that unfollowing before unloading the script used to make the unload faster. When unloading a stalker script in the current .14 version, it takes maybe 5-10 seconds.

bannsec commented 5 years ago

Actually... this appears to also happen when not unfollowing first... Maybe a core issue with unloading stalker script.

bannsec commented 5 years ago

For anyone else running into this, i've discovered that reverting to 12.6.11 fixes this issue. Anything newer than that crashes on unloading stalker and, apparently Stalker.unfollow doesn't really unfollow on anything >= 12.6.12.

Ironically, guessing this new issue was from fixing this issue for .12:

Stalker now allows unfollow() from the transform callback, instead of crashing the process like it used to. Kudos to Giovanni Rocca for helping fix this.

Now normal unfollow/unload crashes instead of crashing during transform unfollow.

bannsec commented 5 years ago

To help with identifying the issue, I've created a PoC script that reliably seems to cause this to happen:

frida_stalker_crash.zip

oleavr commented 5 years ago

Thanks! I'm afraid this must have worked by pure luck previously, because we don't currently support:

Following one of Frida's internal threads, the JS thread in this case. Only valid use-case I can think of for that is to follow what happens during a NativeFunction call to unknown code. But I've never tested this, and we need test-coverage before we can consider this supported.
Following from one script, and unfollowing from another. The current implementation creates one Stalker instance per script. So this is not going to work well.

Not saying we shouldn't support these two unusual use-cases, but are you able to reproduce this issue with a simpler test-case where Stalker is used conventionally? With just one script, and only following application code, not Frida itself.

oleavr commented 5 years ago

Btw, it's a known issue that we currently don't unfollow automatically when unloading a script. We should, but this hasn't been implemented yet.

bannsec commented 5 years ago

So I'm not intentionally trying to follow Frida. My goal was basically the following:

put the main app thread into a wait like state by effectively letting js loop until a memory address is set from a separate script.
turn on stalking
release the thread by setting the value mentioned
script continues to a next wait like state
remove stalker and analyze trace
possibly add new stalker for a next section of execution

The stalking of Frida internal is really just a byproduct of pausing execution not the goal.

On Fri, Aug 16, 2019, 3:35 PM Ole André Vadla Ravnås < notifications@github.com> wrote:

Btw, it's a known issue that we currently don't unfollow automatically when unloading a script. We should, but this hasn't been implemented yet.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/frida/frida/issues/986?email_source=notifications&email_token=AB2HPYFBJ5QV5HGKMKNDR2TQE36RVA5CNFSM4ILP6UD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PQLHQ#issuecomment-522126750, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2HPYF5OD5QUHA4RGH2MTTQE36RVANCNFSM4ILP6UDQ .

oleavr commented 5 years ago

Sure, but this test-case only traces Frida's JS thread. Stalker only traces the thread you ask it to. And different scripts cannot stalk the same threads, that's not supposed to work so if that previously appeared to work, it only seemed that way.

bannsec commented 5 years ago

Then I'm confused. Thread enum sync only returns a single thread id, which is the same ID that the get current thread ID returns. That is or is not the main thread of execution? I have used this approach previously to stalk programs.

On Fri, Aug 16, 2019, 6:47 PM Ole André Vadla Ravnås < notifications@github.com> wrote:

Sure, but this test-case only traces Frida's JS thread. Stalker only traces the thread you ask it to. And different scripts cannot stalk the same threads, that's not supposed to work so if that previously appeared to work, it only seemed that way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/frida/frida/issues/986?email_source=notifications&email_token=AB2HPYEA4ZRAZBC7RH6IZOTQE4U7RA5CNFSM4ILP6UD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4P3ZWY#issuecomment-522173659, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2HPYBXTRH3B5LYZQYH5ILQE4U7RANCNFSM4ILP6UDQ .

oleavr commented 5 years ago

If you're calling Process.getCurrentThreadId() from your script's outer code – which is run when you script.load() – or a callback from a Frida API like setTimeout() – it will be the ID of Frida's JavaScript thread. But if you do it from onEnter / onLeave it will be whichever thread the hooked function was called on. However, Process.enumerateThreads() will filter out Frida's internal threads, so the JS thread should never be in that list, only application threads.

bannsec commented 5 years ago

Fair point. I've modified the test so that it uses thread enumeration instead and uses a looper test that I wrote so that we're not accidentally hitting something to do with the Intercept.

The looper binary basically just sits in a while loop and generates data for Stalking tests. It doesn't ever exit. The test case of stalking -> unstalking -> stalking should be the same except explicitly using thread enumerate to identify the thread. Test case still has the crashing behavior.

stalker_test.zip

oleavr commented 5 years ago

And different scripts cannot stalk the same threads

This is still violating this limitation. Because each script has its own Stalker instance, you cannot follow thread A in script A, and expect to be able to unfollow it from script B.

bannsec commented 5 years ago

Oh, didn't realize that applied to unfollow as well. And since unloading a script doesn't unfollow, am i correct in understanding then that I would need to have the stalker script constantly monitoring some shared memory to be able to unstalk a thread?

Is there any way for the script to register an "on unload" function that would unfollow? At present, the only thing I can think of is using script.post to send a message to the stalker script to unfollow prior to unloading it.

oleavr commented 5 years ago

No it doesn't have to use shared memory, you can expose a function that you call through rpc:

rpc.exports = {
  unfollow: function (tid) {
    Stalker.unfollow(tid);
  },
};

Then from the Python side:

script.exports.unfollow(tid)

For cleanup, you should define an RPC export named dispose. This will get called automatically when a script is .unload()ed:

rpc.exports = {
  dispose: function () {
    /* Do the cleanup – you can even return a Promise if you need to wait for something else to happen first. */
  },
};

Also, if you need to be able to dynamically call JavaScript code not known ahead of time, like to implement a REPL, you can RPC-export a function that calls eval() on a string of JavaScript given to it.

bannsec commented 5 years ago

Thanks! This is super helpful.

So I guess one/two more things. With this modification, it appears there's a race condition between running Stalker.unfollow and unloading the script. Stalker.follow will return immediately, so when i place that inside dispose (or call it as you mentioned with unfollow), it actually causes a segfault and crashes the application.

It appears this is due to Stalker.unfollow still working to unfollow when the Frida core yanks it out of memory. If I add a sleep in there (1 second works, timing may vary due to processor speed), then it successfully unloads without killing the process.

A second issue I've noticed is that, script unload for this stalker script takes maybe 5 seconds. While not terrible, it's definitely slower than unloading most things. I've found that running the Stalker.unfollow export directly and simply leaving the script loaded allows me to quickly run another Stalker on the thread without having to wait for that script to remove. This is a faster approach, with the obvious downside of leaving scripts floating around in the process.

oleavr commented 5 years ago

Glad to hear!

There's unfortunately no good way to deal with that currently, though you might improve things a little by calling Stalker.flush() right after Stalker.unfollow(). This will call onReceive / onCallSummary with any pending data. But yeah, we should really just fix this so we tear down Stalker properly on unload.

As for the delay, that does sound pretty excessive. Would be good to use a profiler to make sure it's not doing something silly. It could also be that it's waiting for the stalked thread to execute some code, as it could theoretically be stuck in a system call that's blocking for 5 seconds. We basically keep calling Stalker.garbage_collect() every 10th millisecond until it returns FALSE to indicate there's no more pending garbage. That logic is here for the Duktape runtime.

bannsec commented 5 years ago

Possible it's related to the issue where I have to force close the python interpreter for frida.

<clipped>
# cleanup[3] wiping math
#   clear[2] __name__
#   clear[2] __doc__
#   clear[2] __package__
#   clear[2] __loader__
#   clear[2] __spec__
#   clear[2] acos
#   clear[2] acosh
#   clear[2] asin
#   clear[2] asinh
#   clear[2] atan
#   clear[2] atan2
#   clear[2] atanh
#   clear[2] ceil
#   clear[2] copysign
#   clear[2] cos
#   clear[2] cosh
#   clear[2] degrees
#   clear[2] erf
#   clear[2] erfc
#   clear[2] exp
#   clear[2] expm1
#   clear[2] fabs
#   clear[2] factorial
#   clear[2] floor
#   clear[2] fmod
#   clear[2] frexp
#   clear[2] fsum
#   clear[2] gamma
#   clear[2] gcd
#   clear[2] hypot
#   clear[2] isclose
#   clear[2] isfinite
#   clear[2] isinf
#   clear[2] isnan
#   clear[2] ldexp
#   clear[2] lgamma
#   clear[2] log
#   clear[2] log1p
#   clear[2] log10
#   clear[2] log2
#   clear[2] modf
#   clear[2] pow
#   clear[2] radians
#   clear[2] sin
#   clear[2] sinh
#   clear[2] sqrt
#   clear[2] tan
#   clear[2] tanh
#   clear[2] trunc
#   clear[2] pi
#   clear[2] e
#   clear[2] tau
#   clear[2] inf
#   clear[2] nan
# cleanup[3] wiping frida
#   clear[1] _frida
#   clear[1] _device_manager

That's what I get when i use python -vvv, run some stalking, then exit. python simply hangs at that point and only way to actually exit is ctrl-c. Not sure how to even go about finding out what the issue is here, but it seems to be something to do with how Frida's _device_manager cleans itself up at exit...

bannsec commented 5 years ago

Trapping into python at this point with gdb i get the following

#1  0x00007fd6faafec4b in g_cond_wait () at ../../../glib/glib/gthread-posix.c:1512
#2  0x00007fd6f79f4786 in frida_async_task_start_and_wait_for_completion (self=self@entry=0xedd290, error=error@entry=0x7ffc0438b438) at ../../../frida-core/src/frida.vala:2420
#3  0x00007fd6f79f4e37 in frida_device_manager_close_sync (self=<optimized out>) at ../../../frida-core/src/frida.vala:46
#4  0x00007fd6f79cae76 in PyDeviceManager_dealloc (self=0x7fd6fb17dda0) at ../../../frida-python/src/_frida.c:1742
#5  0x0000000000572ff0 in ?? ()
#6  0x00000000005542f2 in ?? ()
#7  0x000000000056e826 in PyDict_SetItem ()
#8  0x0000000000565116 in _PyModule_ClearDict ()
#9  0x0000000000459d49 in ?? ()
#10 0x0000000000637efe in Py_FinalizeEx ()
#11 0x0000000000638f5e in Py_Main ()
#12 0x00000000004a6f10 in main ()

Continuing and breaking a few times it appears that python/frida is frozen right there. So perhaps there's some task that gets executed with some assumption that it is indefinitely waiting for.

One thought is that it assumes the process is still alive? The process is dead at this point, so maybe it's making that assumption?

frida / frida

Stalker python script unload == SEGFAULT #986