aengelke / instrew

A high performance LLVM-based dynamic binary instrumentation framework
GNU Lesser General Public License v2.1
223 stars 36 forks source link

Multi-thread seems to be not supported #9

Closed oldiob closed 1 year ago

oldiob commented 1 year ago

Hi,

I'm trying to use instrew for benchmarking against another tool. The benchmark is multi-threaded and it seems that instrew can not handle system calls 56 (clone) and 435 (clone3).

Here is a minimal reproducer:

#include <stdio.h>

#include <pthread.h>

static void *worker(void *nil)
{
        printf("worker\n");

        return nil;
}

int main()
{
        pthread_t th;

        pthread_create(&th, NULL, worker, NULL);
        pthread_join(th, NULL);

        printf("main\n");

        return 0;
}

With instrew, the worker string is never printed.

Distro: Guix

younghojan commented 1 year ago

hi there! According to this thesis, multi-thread is not supported by now.

aengelke commented 1 year ago

This is correct, clone is not implemented right now. I did start to work on multi-threading some time ago, but this is "non-trivial" to do right, to say the least.

clone has fairly complex semantics, and some operations are unlikely to ever work:

The other problem is that threads/processes closely interact with signals, which are extremely difficult to implement efficiently. Pthread implementations like NPTL often require working signal delivery. Right now, I only handle signal handler registration, but abort when actually attempting to deliver a signal. This would need to interrupt the thread at some point, but if the thread is in a tight loop, getting the whole ucontext is hard. Checking a "signal pending" flag in every loop entry is possible, but inefficient. Relying on LLVM to generate proper DWARF info for reconstructing the data might be possible but seems fragile and correctness is probably not guaranteed. Using LLVM while getting guaranteed information at every instruction boundary seems like a good research question...

Yet another problem with shared memory is memory consistency. Right now, I don't have to care. However, when translating x86 guest code, all operations do have atomicity constraints (acquire/release), but can also be unaligned. This is not representable in LLVM and even if it were, the translated code would be much slower. Translating to non-TSO architectures requires fences everywhere (or acquire semantics, but this often requires alignment, which is not required on x86) (unless you can reconfigure your CPU to implement TSO like Apple does). x86 also supports unaligned atomics (on cache line split, they lock the whole bus), again not representable in LLVM or any other ISA. Many people tried to efficiently translate x86 to non-TSO architectures, but so far, no good solution has been found and it might be impossible (maybe someone could actually prove that, so that no one has to waste more time on this matter).

That said, your code is not correct. You must check pthread_create for errors and you can only join if the thread was successfully created.

oldiob commented 1 year ago

CLONE_VM without CLONE_THREAD starts a new process that shares the virtual memory. If at all possible, this seems difficult to handle.

It is possible to do so. One usage of this is to be able to PTRACE_SEIZE the parent and unwind its stack without using PEEK_DATA.

This is not representable in LLVM

I am not familiar with LLVM, but I'm quite surprised to hear that it cannot represent instructions alignment. I wonder how Rosetta from Apple does then.

That said, your code is not correct. You must check pthread_create for errors and you can only join if the thread was successfully created.

Right. That was just a toy example. I do get a ENOSYS from pthread_create. This explains the problems I had with my benchmark.

Thanks for the answer!

I'm closing the issue since this is a documented limitation.

aengelke commented 1 year ago

It is possible to do so. One usage of this is to be able to PTRACE_SEIZE the parent and unwind its stack without using PEEK_DATA.

I know that such a clone is possible and has use cases; what I meant was: this is hard to implement in a DBT system. The new process can have different signal handlers, etc., and this state needs to be managed in a single address space for an unlimited amount of processes. Not even QEMU supports this.

I am not familiar with LLVM, but I'm quite surprised to hear that it cannot represent instructions alignment.

This is not about instruction alignment, but unaligned memory addresses for atomic operations. LLVM-IR instructions like atomicrmw or cmpxchg require that the alignment of the memory address is greater or equal to the value size, which is not required on x86.

I wonder how Rosetta from Apple does then.

Their CPUs have hardware support for operations beyond standard AArch64, including a TSO mode (implementing x86 memory ordering/semantics in hardware – much faster/better/easier than in software), special instructions, and extra flags equivalent to x86 AF/PF. I'm not sure whether they support unaligned atomic operations in hardware, but I wouldn't be surprised if they did.

younghojan commented 9 months ago

I also encountered this problem during the test of SPEC CPU 2017 647.xz_s. Please tell me how to solve it. This is very important to me, thank you!

root@9c2d8b47d826:~/SPEC2017/benchspec/CPU/657.xz_s/run/run_base_refspeed_i686-static-m64.0000# ~/instrew/build/server/instrew ../../exe/xz_s_base.i686-static-m64 cpu2006docs.tar.xz 6643 055ce243071129412e9dd0b3b69a21654033a9b723d874b2015c774fac1553d9713be561ca86f74e4f16f22e664fc17a79f30caa5ad2c04fbc447549c2810fae 1036078272 1111795472 4

SPEC CPU XZ driver: input=cpu2006docs.tar.xz insize=6643
Loading Input Data
Compressed size: 1287176; Uncompressed size: 9041920
SHA-512 of decompressed data compared successfully!
SHA-512 of input file: 5eec56e04269bcb81dd120f2f81299e973341e4a2579c146ccea7af4a74fbf7049966dd7fb91d6fbecfa2238d096a6ead91379e4a0c9bf11d5ec7d0472f369bf
Input data 6965690368 bytes in length
Compressing Input Data, level 4
work available for up to 6965690368 / 13631488 => 512 threads
context size per thread: 13312 KB
unhandled syscall 435 (ffff9056ea70 58 459710 8 fffbfcb0e640 ffff9056eb7f) = -ENOSYS -- please file a bug with these numbers and the architecture
unhandled syscall 56 (3d0f00 fffbfcb0e1f0 fffbfcb0e910 fffbfcb0e910 fffbfcb0e640 fffbfcb0e640) = -ENOSYS -- please file a bug with these numbers and the architecture

libgomp: Thread creation failed: Function not implemented
aengelke commented 9 months ago

Add to intspeed,fpseed in your SPEC CPU config EXTRA_OPTIMIZE = -fno-openmp -DSPEC_SUPPRESS_OPENMP and set threads=1. Consult the SPEC CPU documentation for more details.

younghojan commented 9 months ago

Add to intspeed,fpseed in your SPEC CPU config EXTRA_OPTIMIZE = -fno-openmp -DSPEC_SUPPRESS_OPENMP and set threads=1. Consult the SPEC CPU documentation for more details.

That's very useful, thanks a lot!