crash in android in disk_io_thread::try_flush_hashed

aldenml commented 7 years ago

Please provide the following information

libtorrent version (or branch): master

platform/architecture: android

compiler and compiler version: clang NDK 13

pid: 21826, tid: 21936, name: Thread-7015  >>> com.frostwire.android <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xdaa00000
    r0 00000000  r1 00000010  r2 00000010  r3 da9fffe4
    r4 00000010  r5 0000000e  r6 00000080  r7 00000010
    r8 040000d6  r9 00400100  sl 00000010  fp dd8fa7f8
    ip 00000010  sp dd8fa798  lr 00000001  pc dfdeefd8  cpsr 800f0010

backtrace:
    #00 pc 00221fd8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread16try_flush_hashedEPNS_18cached_piece_entryEiRNS_9tailqueueINS_11disk_io_jobEEERNSt6__ndk111unique_lockINS7_5mutexEEE+340)
    #01 pc 00228a18  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread15do_flush_hashedEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+252)
    #02 pc 00223d6c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread11perform_jobEPNS_11disk_io_jobERNS_9tailqueueIS1_EE+144)
    #03 pc 0022947c  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread10thread_funENS0_13thread_type_tEN5boost4asio10io_service4workE+436)
    #04 pc 0022a6a8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZN10libtorrent14disk_io_thread9job_queue10thread_funEN5boost4asio10io_service4workE+56)
    #05 pc 0022f0b8  /data/app/com.frostwire.android-2/lib/arm/libjlibtorrent.so (_ZNSt6__ndk114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN10libtorrent21pool_thread_interfaceEFvN5boost4asio10io_service4workEEPS8_SC_EEEEEPvSH_+156)
    #06 pc 0003fa3b  /system/lib/libc.so (_ZL15__pthread_startPv+30)
    #07 pc 0001a085  /system/lib/libc.so (__start_thread+6)

Any hint of where to look? Now I have debug versions but are useless, since this error manifest only in release while running in the wild.

arvidn commented 7 years ago

do you know what line that access is on, and which variable it is for? Or, I forget, do you get dumps?

aldenml commented 7 years ago

This is all I get from the google play developer console. It happens with a frequency that I can't ignore or blame the device.

aldenml commented 7 years ago

I see in the code four calls to alloca, do you think this could be a problem?

file::iovec_t* iov = TORRENT_ALLOCA(file::iovec_t, p->blocks_in_piece * cont_pieces);
int* flushing = TORRENT_ALLOCA(int, p->blocks_in_piece * cont_pieces);
// this is the offset into iov and flushing for each piece
int* iovec_offset = TORRENT_ALLOCA(int, cont_pieces + 1);
int iov_len = 0;
// this is the block index each piece starts at
int block_start = 0;
// keep track of the pieces that have had their refcount incremented
// so we know to decrement them later
int* refcount_pieces = TORRENT_ALLOCA(int, cont_pieces);

arvidn commented 7 years ago

you think you may overflow the stack? I would think you'd get a different error for that, but maybe not. does android use segmented stacks?

arvidn commented 7 years ago

gcc calls it split stack. anyway, a short term fix to mitigate this may be to run with a single disk thread. If you try that, and it goes away, that would suggest it's a race condition.

aldenml commented 7 years ago

you know, I can't find any information related to this (segmented stacks) and android. Not clear to me if this is something provided by the compiler alone, or it needs some help from the architecture (assembly instructions) or even the standard library.

Anyway, I added the flag -fsplit-stack to the compiler only to get

fatal error: error in backend: Segmented stacks do not support vararg functions.
clang38++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Android clang version 3.8.256229  (based on LLVM 3.8.256229)
Target: armv7a-none-linux-android
Thread model: posix

Do you have any idea of how to properly test if it's a stack problem? Do you know if it's possible to add a unit test to detect if there is a race condition problem?

arvidn commented 7 years ago

you could write a test program that triggers a stack overflow and run it on android to see how the error is reported. If it looks different than the report in this ticket we could rule it out.

as for triggering races in tests, the only way I can think of would be to write a stress test that puts heavy , sustained load on the disk subsystem. The actual disk I/O would probably have to be disabled too

aldenml commented 7 years ago

(in case you have 5 mins to tell me) how I can force the call to disk_io_thread::try_flush_hashed in a transfer? any settings?

arvidn commented 7 years ago

of the top of my head, try_flush_hashed is called for a piece when the cache is under pressure and it flushes blocks in the piece that have already been hashed. i.e. the SHA-1 hash cursor has passed over them and their data has been incorporated in the SHA-1 state (which is held by the hash pointer in the chached_piece_entry object).

If you're interested in a stress test, you can use connection_tester, which I believe is under tools or test. With it you can generate data, generate torrents, connect to a peer using a specific torrent and upload, download fast, or both. connection_tester can make multiple parallel connections, so you may want to turn off the one-peer-per-ip setting in libtorrent under test too.

aldenml commented 7 years ago

This crash is happening more and more :(

Hi @ssiloti, I notice this commit (https://github.com/aldenml/libtorrent/commit/9c5bb25d6570ddbc98308499086c5073d6482749) and I wonder if it's entirely right. Previously the call to pe->outstanding_flush = 0; was after the try_flush_hashed. Now it's right at the beginning, but if pe->num_dirty != 0 and pe->hashing_done, then I fear that multiple "flush" tasks are scheduled beacouse try_flush_hashed unlocks for a moment. Does this makes sense? Thanks

ssiloti commented 7 years ago

It's possible there's something subtle going on there. Since you can reproduce it easily try reverting that commit and seeing what happens. IIRC that commit was a fix for a deadlock when the write cache is disabled so as long as you have that enabled everything should still work.

ssiloti commented 7 years ago

Hmm, I just noticed the comment in do_flush_hashed saying that there can be multiple flush jobs active anyways so it seems we need to handle this case. What we really need is a line accurate backtrace so that we can pinpoint which line is causing the crash.

arvidn commented 7 years ago

@aldenml if you still have the binary+debug information for that build (or can rebuild it), you can load it up in gdb and issue: "list *address", where address is one of the addresses in the stack-trace (after "pc"). That will tell you exactly where it is (within the accuracy you get with optimizations enabled)

ssiloti commented 7 years ago

@aldenml is there any pattern to the fault address you see in the crash reports? The one in the report you posted is above the 3GB mark which is odd because that puts it in the kernel's address space. That suggests this isn't just a simple dangling pointer access or stack/buffer overflow.

frostwire commented 7 years ago

Here's the value of the PC register value on the last crashes received on try_flush_cached

pc e006cfd8 -> 3758542808 -> 3.75 GB
pc e01f3fd8 -> 3760144344 -> 3.76 GB
pc dfcfafd8 -> 3754930136 -> 3.75 GB
pc dedd7fd8 -> 3739058136 -> 3.73 GB
pc 9f319fd8 -> 2670829528 -> 2.67 GB*
pc dfeb9fd8 -> 3756761048 -> 3.75 GB
pc e05e5fd8 -> 3764281304 -> 3.76 GB
pc 9e4a7fd8 -> 2655682520 -> 2.65 GB
pc e0452fd8 -> 3762630616 -> 3.76 GB
pc e040ffd8 -> 3762356184 -> 3.76 GB
pc a1146fd8 -> 2702471128 -> 2.7 GB*
pc 9e4a9fd8 -> 2655690712 -> 2.65 GB*

They all seem to be around there. Except a few* that seem to be below 0xc0000000

frostwire commented 7 years ago

@ssiloti could this mean that the crash occurs during a system call from try_flush_cached? meaning the system call points the PC register at some routine referenced in kernel address space

ssiloti commented 7 years ago

I meant the fault address, not the program counter, in the report posted the fault was at 0xdaa00000.

frostwire commented 7 years ago

fault addr 0xd88c0000 -> 3633053696
fault addr 0xd3d00000 -> 3553624064
fault addr 0xd2c80000 -> 3536322560
fault addr 0xdd900000 -> 3717201920
fault addr 0x99d40000 -> 2580807680*
fault addr 0xd8b40000 -> 3635675136
fault addr 0xd4f80000 -> 3573022720
fault addr 0x98580000 -> 2555904000*
fault addr 0xd8840000 -> 3632529408
fault addr 0xd5d40000 -> 3587440640
fault addr 0xd8140000 -> 3625189376
fault addr 0x98c00000 -> 2562719744*
fault addr 0x91080000 -> 2433220608*

* address below 0xc0000000, in user space.

aldenml commented 7 years ago

The error is not easy to replicate, in fact, it never happened in our internal tests. I have no pattern in the error reporting, only that it seems to happens more in Android 6. (interesting the point about the address)

With your comments and suggestions, I will try harder to get the necessary information. Thanks again.

aldenml commented 7 years ago

@arvidn @ssiloti after doing some workout with gdb :) I finally narrowed the crash to this line:

        // count number of blocks that would be flushed
        int num_blocks = 0;
        for (int i = end-1; i >= 0; --i)
====>>>>    num_blocks += (p->blocks[i].dirty && !p->blocks[i].pending);

        // we did not satisfy the block_limit requirement
        // i.e. too few blocks would be flushed at this point, put it off
        if (block_limit > num_blocks) return 0;

Any ideas of how cached_piece_entry::blocks can become invalid/corrupt?

aldenml commented 7 years ago

Or is it the end calculation wrong?

ssiloti commented 7 years ago

Still not sure what's going on here. The fact the the upper 16-bits of the fault address appears to be garbage and the lower 16-bits all zeros suggests something, I'm just not sure what. The obvious suspect would be a torn read, but those shouldn't happen with 32-bit loads on ARM.

aldenml commented 7 years ago

I wish I could say something relevant (reading your PR)

aldenml commented 7 years ago

unfortunately, I can confirm that the PR https://github.com/arvidn/libtorrent/pull/1255 does not fix the error (I ported it to master and I got the crash)

ssiloti commented 7 years ago

Did you have asserts enabled? It would be interesting to know if the assert #1255 added was triggered.

aldenml commented 7 years ago

no, but now that the crash is happening in our tests, I would do it

arvidn commented 7 years ago

I have experienced (second hand) unaligned reads not to cause a SIGBUS, but instead cause a word-flipped result (i.e. an unaligned 32 bit read stitching together two 16 bit reads that are not contiguous, but from adjacent words). This was on an alpha though, I would expect ARM to be a bit less weird, but maybe there are weird arms out there too

aldenml commented 7 years ago

Good news, the crash is gone!

We were lucky to have a device in which the crash manifested repeatedly. The PR https://github.com/arvidn/libtorrent/pull/1255 was definitively an improvement but only after the last comment I was thinking about the possibility of bad things going on with -O3 and struct's bitfields and mutex-locks. I just compiled with -Os and the crash disappears.

It's a monumental task to review all the assembly involved to determine what exactly is happening, maybe the ARM clang code generation needs more working. Well, this is all and thanks for all the help.

arvidn commented 7 years ago

it might be worth turning the bitfields into normal members, to see if that fixes it too. I wonder if there's a tearing issue because of it. Perhaps some of those fields must live in their own byte to be thread safe

arvidn commented 7 years ago

I think it is important to understand this. My suspicion is that the code is wrong, it just doesn't manifest itself on x86

arvidn commented 7 years ago

actually, all of the fields in cached_block_entry should only be accessed when holding the mutex, so tearing shouldn't be an issue. as far as I can tell.

aldenml commented 7 years ago

I'm reopening the issue as a remainder that this needs further investigation.

joycepg commented 7 years ago

we've been getting this error too, but only on certain devices. the nexus 7 has consistently crashed, whereas some samsung 10" tablets don't. i managed to get our ndk project loaded inside android studio, and finally got a traceback. android-libtorrent-crash not really sure what to do next though! this is using libtorrent 1.1.1 release

joycepg commented 7 years ago

i noticed no line numbers in the stacktrace above for libtorrent code. so i re-ran configure with --enable-debug. when i ran it, it did not crash.

then i modified configure to output -O0 instead of -Os when not debugging.

this resulted in a crash in a different spot, but during the same operation (downloading 6 torrents simultaneously) android-libtorrent-crash-o0

joycepg commented 7 years ago

another. same settings, same time (downloading 6 torrents), but different spot. smells like a rogue pointer somewhere, stomping over the heap earlier. android-libtorrent-crash-o0-again

arvidn commented 7 years ago

I've found that some older ARM chips have buggy atomic operations. You may want to try making boost::shared_ptr use a mutex instead of atomic increment/decrement. There's a define for that.

joycepg commented 7 years ago

you may be onto something there.

the define i found was BOOST_SP_NO_ATOMIC_ACCESS

i tried configuring libtorrent: CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ... but still experienced crashes.

i also tried to use pthreads instead of atomics: CPPFLAGS="-DBOOST_AC_USE_PTHREADS=1 -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ... but the app didn't even run. it hung during initialization

i seem to be at a dead end.

arvidn commented 7 years ago

I'm not sure if boost::shared_ptr has any compiled components in boost thread, but you may want to make sure that boost itself is built with the same configuration. (if you build with boost build, this property would be propagated to dependencies)

Falcosc commented 7 years ago

@joycepg did you try to rebuild boost to exclude this failure cause?

joycepg commented 7 years ago

finally got back to trying this.

rebuilt boost with

<compileflags>-DBOOST_ASIO_DISABLE_STD_ATOMIC=1
<compileflags>-DBOOST_SP_NO_ATOMIC_ACCESS=1

added to ~/user-config.jam

rebuilt libtorrent with

CPPFLAGS="-DBOOST_ASIO_DISABLE_STD_ATOMIC=1 -DBOOST_SP_NO_ATOMIC_ACCESS=1" ./configure ...

then rebuilt my .so with these in Android.mk

LOCAL_CFLAGS := -DBOOST_ASIO_DISABLE_STD_ATOMIC=1 \
-DBOOST_SP_NO_ATOMIC_ACCESS=1 \
-O0 -g

no luck.

the message variable didn't include anything useful. just the libtorrent version number 1.1.1.0-1229491

note that when i link my .so, which uses libtorrent, i only need to link with boost_system, boost_chrono and boost_random. the boost_thread library does not even compile. The error is

gcc.compile.c++ bin.v2/libs/thread/build/gcc-android4.9/release/link-static/runtime-link-static/target-os-linux/threading-multi/pthread/thread.o
In file included from libs/thread/src/pthread/thread.cpp:19:0:
./boost/thread/future.hpp: In member function 'void boost::promise<void>::lazy_init()':
./boost/thread/future.hpp:2549:37: error: 'atomic_load' was not declared in this scope
             if(!atomic_load(&future_))
                                     ^

but i didn't care to solve that because i don't need to link with it anyway.

i'm happy to receive suggestions. it's pretty quick for me to recompile and reproduce.

Falcosc commented 7 years ago

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce. I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

I did try to find the correct combination of build flags by looking through the code. But I didn’t find much at boost thread https://github.com/boostorg/thread/blob/develop/include/boost/thread/future.hpp But they use macros to overwrite the atomic_load definition. I guess there is option somewhere which enables ARM compatible atomic_load macros. Would be very nice if you could ask the atomic module team how to build boost for ARM Chips with non-reliable atomic implementations. I hope they could tell us how to disable atomic for boost or how to enable something which does apply a workaround for non-reliable atomic implementations of some ARM chips. I guess you should create an issue here https://github.com/boostorg/atomic Would be very nice if you could link it to our ticket, because as a result of it I hope we could find an ARM build suggestion for libtorrent documentation.

joycepg commented 7 years ago

Thanks for your reply

Thanks a lot to bring this further and nice to hear that at least the compilation failure is easy to reproduce.

The compilation of boost.thread is not the issue. boost.thread compiled library is not required in the linking. The libtorrent configure output shows this:

Boost libraries:
  version:              
  CPPFlags:             -I/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//include
  LDFlags:              -L/Users/pjoyce/Developer/latlongo_dependencies/android/armeabi-v7a//lib
  boost.system:         -lboost_system
  boost.chrono:         -lboost_chrono
  boost.random:         -lboost_random

So now I do not even attempt to build boost.thread. I just use

./b2 --with-system --with-chrono --with-random toolset=gcc-android4.9 link=static runtime-link=static target-os=linux --stagedir=android

regarding the actual error i am facing

I hope it’s not too hard to check the atomic error if we found a correct way to compile boost.

The issue is that libtorrent is crashing at pretty random locations and times (as shown by screenshots).

If you have some other compile options you want me to try, it is quick to recompile boost.{system,chrono,random}, libtorrent and my .so, and reproduce the crash. It takes about 5 minutes.

I will try to raise an issue on the boost repo that you found. they say there that armv6 is a supported platform with gcc 4.x. our device is armv7 and we are using gcc 4.9, so it should be ok.

joycepg commented 7 years ago

well that didn't take long for the boost atomic people to bounce it :-/

timblechmann commented 7 years ago

@joycepg i didn't bounce it, but pointed you to the right direction: according to the stack trace, is related to the smart pointer, there is no boost.atomic code involved.

arvidn commented 7 years ago

@joycepg I'm not sure what BOOST_SP_NO_ATOMIC_ACCESS does. The issue I've seen in the past (granted, this was almost 10 years ago now) on some arm chips was that the atomic operations did not have correct semantics and would break boost.shared_ptr. The work-around, then, was to enable BOOST_AC_USE_PTHREADS, which uses pthread_mutex instead of atomic add.

I don't know exactly what boost.shared_ptr used to do, I don't see any specific arm code in there now, presumably it's always tried to use intrinsics on gcc, like __sync_add_and_fetch.

If that doesn't help, it's most likely not a bug in boost, but in libtorrent.

joycepg commented 7 years ago

i had a read of the boost code and doco. in the boost asio doco it says

Support for the standard atomic integer template is automatically enabled for g++ 4.5 and later, when the -std=c++0x or -std=gnu++0x compiler options are used.

i saw this in the boost code too. so i made sure i passed -std=gnu++0x to the boost and libtorrent builds. this was nice symmetry, since i use that flag to build our app anyway.

no change; still crashed.

i then added -DBOOST_AC_USE_PTHREADS=1 -DBOOST_SP_USE_PTHREADS=1 to the flags, and rebuilt everything. when i ran it several times, it seemed to hang randomly, in the libtorrent/boost code. possibly a deadlock occurring.

arvidn commented 7 years ago

It sounds like there might be a bug in libtorrent's disk cache then

joycepg commented 7 years ago

ok. i'm happy to work with you to try to isolate/fix

aldenml commented 7 years ago

Hi @arvidn, the code by now is so different and the error (if any) is not happening in this way anymore. Do you think it's reasonable to close this?

arvidn commented 7 years ago

yeah, probably.

arvidn / libtorrent

crash in android in disk_io_thread::try_flush_hashed #1225