HewlettPackard / quartz

Quartz: A DRAM-based performance emulator for NVM
https://github.com/HewlettPackard/quartz
Other
158 stars 66 forks source link

no rule to make tart pmc.o #7

Open SunnyBeike opened 7 years ago

SunnyBeike commented 7 years ago

Hi, I met the following error while compiling the Quartz code:

[root@localhost build]# make [ 8%] Built target cpu [ 82%] Built target nvmemul [ 86%] Device] make[5]: ** No rule to make target `/home/sbl/Quartz/quartz-master/build/src/dev/pmc.o, needed by /home/sbl/Quartz/quartz-master/build/src/dev/nvmemul.o. Stop. make[4]: [module/home/sbl/Quartz/quartz-master/build/src/dev] Error 2 gmake[3]: [all] Error 2 make[2]: [src/dev/nvmemul.ko] Error 2 make[1]: [src/dev/CMakeFiles/dev_build.dir/all] Error 2 make: [all] Error 2**

The environment I use is 2Socket Xeon5600/CentOS-7/Linux4.10/gcc-4.8.5. I have installed all the required packages in the README.md, and compile the code in the following steps:

mkdir build cd build cmake .. make

and the aforementioned error occurs... Any suggestions?

Thank you very much.

hvolos commented 7 years ago

Could you please check whether pmc.c is found in Quartz/quartz-master/build/src/dev/ If not, then maybe try to remove the build folder and rebuild.

SunnyBeike commented 7 years ago

Thanks Hvolos. pmc.c is found in quartz-master/build/src/dev. I have tried removing the build folder and rebuliding several times, but it din't work. It seems that the pmc.c module is not compiled as a module. Since no other errors were reported, I don't know how to solve this problem.

hvolos commented 7 years ago

What happens if you try to change into the quartz-master/build/src/dev/ directory and invoke 'make' directly from in there? Do you still get the same problem? Can you send me the list of files in that directory after you run the make command? Also, could you please make sure you have the kernel source (kernel-devel package) installed.

SunnyBeike commented 7 years ago

I am using Linux4.10 on CentOS7. I have installed kernel-devel (kernel-devel-3.10.0-514.21.1.e17.x86_64) through 'yum install’.
Both /lib/modules/3.10.0-514.21.1.e17.x86_64 and /lib/modules/4.10.0 exist.

Files in the build/src/dev directory are: CMakeFiles cmake_install.cmake CMakeLists.txt ioctl_query.h Makefile pmc.c

I got the same problem after invoking 'make' in build/src/dev. [root@localhost dev]# make make -C /lib/modules/uname -r/build M=pwd modules make[1]: Entering directory /home/sbl/Quartz/linux-4.10' make[2]: *** No rule to make target/home/sbl/Quartz/quartz-master/build/src/dev/pmc.o', needed by `/home/sbl/Quartz/quartz-master/build/src/dev/nvmemul.o'. Stop. make[1]: * [module/home/sbl/Quartz/quartz-master/build/src/dev] Error 2 make[1]: Leaving directory `/home/sbl/Quartz/linux-4.10' make: * [all] Error 2

thanks hvolos!

SunnyBeike commented 7 years ago

Hi, I solved the problem. This problem is caused by the CONFIG_STACK_VALIDATION option of the Makefile while compiling the linux kernel. You should cancel this option while 'make menuconfig' if you use the latest Linux kernel (Linux4.10, for example, which I tend to use it to support NVML).

However, I met another problem while running quartz.

[root@localhost scripts]# ./runenv.sh ls Turbo Boost disabled for all CPUs ./../build/src/lib/libnvmemul.so ./../nvmemul.ini ./runenv.sh: line 52: 11838 Segmentation fault (core dumped) $@

I found that the program exists unnormally after failing to open /proc/cpuinfo in the 'cpuinfo()' function.

I am still trying... Any suggestions will be appreciated.

guimagalhaes commented 7 years ago

Please configure the max debug level and copy here the log output. Also the stack trace is very useful. You can configure a core file to be generated (ulimit -c unlimited) and open run application again with GDB and provide the core file generated before. Then run 'bt' to request the backtrace. Thanks.

SunnyBeike commented 7 years ago

Thanks. The program crashes at init()->cpu_model()->is_Intel()->cpu_model_name()->cpuinfo()->fopen(). Acutually, once I call 'fopen' in Quartz, the programme crashes. However, 'fopen' can be used normally in my own programs.

I set the debug level to 5. The error log is :

[root@localhost scripts]# ./runenv.sh ls Turbo Boost disabled for all CPUs ./../build/src/lib/libnvmemul.so ./../nvmemul.ini [27285] [1496202133] DEBUG in dbg_init </home/sbl/Quartz/quartz-master/src/lib/debug.c,90>: [27285] [1496202133] INFO in register_self </home/sbl/Quartz/quartz-master/src/lib/thread.c,296>: Registering thread tid [27285] ./runenv.sh: line 52: 27285 Segmentation fault (core dumped) $@

The following information is obtained by 'ulimit -c unlimited' + 'gdb -c core.28051': Core was generated by `ls'. **Program terminated with signal 11, Segmentation fault.

0 0x00007f5c323637d7 in ?? ()

(gdb) bt

0 0x00007f5c323637d7 in ?? ()

1 0x0000000000000000 in ?? ()

(gdb)**

And, the error log in 'dmesg' is [345041.923903] ls[28051]: segfault at 40 ip 00007f5c323637d7 sp 00007ffca131b5d0 error 4 in libnvmemul.so[7f5c32359000+15000]

Thanks.

guimagalhaes commented 7 years ago

Could you repeat the experiment with other application than /bin/ls? Let use a simple application, maybe you could write a simple C program and run it with Quartz.

SunnyBeike commented 7 years ago

Hi, I tried several simple programms, but the same errors were reported.

Though the init() function uses the unsetenv() to avoid getting into recursive preloads, but it did not work for me. The fopen() function still uses the pthreadmutex* () provided by quartz. And, the errors occur.

I fixed this problem by setting a flag in the init() function. The flag will tell pthreadmutex* () whether or not use the pre_load library. This works only for very simple programs.

But I don't think this is the right way to make Quartz work normally, because I meet another segfalut while running the following tests:

**#define BUFFER_SIZE 1024

define REPEAT_TIMES 1024*4

void main() { int i = 0; char mem; while(i < REPEAT_TIMES) { mem = (char ) malloc(sizeof(char) * BUFFER_SIZE); memset(mem, 'a', BUFFER_SIZE); free(mem); printf("\n%d\n", i++); } } ** The 'bt' reports the following errors:

**#0 0x00007fd9a7a4b2d8 in ?? ()

1 0x00007fd9a7a4b87c in ?? ()

2 0x0000000000002710 in ?? ()

3 0x00007fd9a7c528e0 in ?? ()

4 0x00007fd9a7e75000 in ?? ()

5 0x00007fd9a7a4a0c7 in ?? ()

6 0x0000000000fe7030 in ?? ()

7 0x00007fd9a7e75000 in ?? ()

8 0x0000000000fe7030 in ?? ()

9 0x00007fd9a7e75000 in ?? ()

10 0x00000000000003e8 in ?? ()

11 0x00007fd9a7a4569a in ?? ()

12 0x00007fd9a706de00 in ?? ()

13 0x0000000000fe37d0 in ?? ()

14 0x0000003000000010 in ?? ()

15 0x00007ffeb1c8f440 in ?? ()

16 0x00007ffeb1c8f380 in ?? ()

17 0x0000000000000000 in ?? ()**

'dmesg' reports the following errors:

[30047.409912] traps: a.out[5361] general protection ip:7fd9a7a4b2d8 sp:7ffeb1c8ed68 error:0

SunnyBeike commented 7 years ago

By the way, I find that programs that don't call pthread_create() will not be registerred to the monitor thread. If I use a programm that doesn't call pthread_create, wil the program be monitored?

hvolos commented 7 years ago

The correct behavior is for fopen to use the pthreadmutex provided by Quartz as Quartz needs to interpose on these events. The Quartz version then calls the actual pthreadmutex. I am wondering whether Quartz actually calls its version resulting in a recursive call.

Concerning the bt report. Could you please compile with the symbols on so that we can know the function name causing the problem?

guimagalhaes commented 7 years ago

The threads will be monitored, but pthread is required. Any program is monitored since the main thread is registered as part of init(). At the time the init()....cpu()...fopen() is called, as part of the initialization, the interposing should be already made. However, it seems the function pointer is NULL. This problem seems related to library load order. Please, set LD_PRELOAD to the quartz library (see runenv.sh) and then run "ldd ". Let's see the order of the libraries and what we can do.

guimagalhaes commented 7 years ago

I mean, 'ldd [app]'...

SunnyBeike commented 7 years ago

Thanks, Hvolos. I have added "add_definitions("-Wall -g")" in the CMakeLists.txt. However, when the error occurs, I still cannot get the symbols.

SunnyBeike commented 7 years ago

If I use './runenv.sh ldd [app]', then segfault occurs. If Iuse 'export' to set the pre_load environment, the following error was reported : ERROR: ld.so: object '/src/lib/libnvmemul.so' from LD_PRELOAD cannot be preloaded: ignored.

I am quite sure that the problem is related to the pre_load library of Quartz. Perhaps caused by pthreadmutex*(), because many glibc APIs (e.g., fopen) uses these functions. I am still checking.

hvolos commented 7 years ago

What linux kernel are you using? I just found that since Linux 4.0, there has been a behavior change when reading performance counters directly from user mode. Specifically, Quartz configures user mode access to counters by setting the right msr registers, but it looks like since Linux 4.0, the default behavior is to allow user mode access only to processes with active perf events. Otherwise, trying to read counters from user mode will cause a protection violation (seg fault). Please try the following and see whether this fixes your problem:

echo "2" | sudo tee /sys/bus/event_source/devices/cpu/rdpmc

guimagalhaes commented 7 years ago

Should we add this check to runenv.sh?

    v=$(uname -r | cut -d '.' -f1)
    if [ $v -ge 4 ]; then
        echo "2" | sudo tee /sys/bus/event_source/devices/cpu/rdpmc
    fi
SunnyBeike commented 7 years ago

Hi, I am using Linux4.10. It didn't work to set 'rdpmc'.

The segfault is caused by the ld_preload library, which incurs endless recursion and results in stack overflow. I solved this problem (temporarily) by setting a static flag to avoid recursively calling the preloaded functions. The following code shows the way to modify the pthread_mutex_lock() function. pthread_create, pthreadmultex*() functions should be modified in the similar ways.

___thread static int no_hook;_ int pthread_mutex_lock(pthread_mutex_t *mutex) { int err; if (lib_pthread_mutex_lock == NULL) initinterposition(); **if(no_hook) { err = lib_pthread_mutexlock(mutex); return err; }** _no_hook = 1;_ if (latency_model.enabled) { if(reached_min_epoch_duration(thread_self())) { create_latency_epoch(); } } if (lib_pthread_mutex_lock == NULL) init_interposition(); err = lib_pthread_mutexlock(mutex); no_hook=0_ return err; }

I have succeessfully run the benchmarks provided by Quartz, several simple programs wrote by myself, and VoltDB as well. But, it still repots segfalut when run redis-server.

guimagalhaes commented 7 years ago

I didn't understand this fix. After init_interposition(), __lib_pthread_mutex_lock points to this local pthread_mutex_lock, and then the call to lib_pthread_mutex_lock enters in infinite recursion? We should seek for a fix to init_interposition(), or terminate the application if lib_pthread_mutex_lock is NULL. On this case, probably there is a problem with the libraries load order. Can you provide the output of 'ldd [binary]' of the applications you are using. Is there a different problem with redis? Can you provide the back trace? Please compile the libraries with '-g'. Thanks.

hvolos commented 7 years ago

Please also do the following, and let us know of the result:

  1. Update your local git repo (git pull)

  2. Remove direct linking of test test_mutex to nvmemul by doing the following edit in test/CMakeLists.txt

    target_link_libraries(test_mutex nvmemul pthread)

    target_link_libraries(test_mutex pthread)

  3. Build library and tests

  4. Try running test test/test_mutex with LD_PRELOAD and with out.

  5. Now, try running the test inside gdb. Assuming you built inside quartz/build, then : $ gdb ./build/test/test_mutex (gdb) set environment LD_PRELOAD=./build/src/lib/libnvmemul.so (gdb) b init_interposition (gdb) r When you hit the breakpoint, step execution through the init_interposition. When all __libpthread variables get initialized, print pthread_mutex_lock: (gdb) n (gdb) n Keep stepping through execution using 'n' until you reach the end of init_interposition. Then: (gdb) p __lib_pthread_mutex_lock (gdb) p pthread_mutex_lock Please also list the loaded libraries (and report the result) (gdb) info sharedlibrary

Please report the result of the above commands

SunnyBeike commented 7 years ago

Thanks Hvolos. If I run test_mutex with the pre_load environment, segfault is reported. However, it succeeds without pre_load library.

Here is the result of gdb:

[root@localhost test]# gdb ./test_mutex
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/sbl/Quartz/quartz-test/quartz-master/build/test/test_mutex...done.
(gdb) set environment LD_PRELOAD=../src/lib/libnvmemul.so
(gdb) b init_interposition
Function "init_interposition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (init_interposition) pending.
(gdb) r
Starting program: /home/sbl/Quartz/quartz-test/quartz-master/build/test/./test_mutex
ERROR: nvmemul: Configuration file nvmemul.ini not found.
ERROR: nvmemul: Initialization failed. Running without non-volatile memory emulation.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, init_interposition () at /home/sbl/Quartz/quartz-test/quartz-master/src/lib/interpose.c:47
47      {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.2.x86_64 libconfig-1.4.9-5.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libgomp-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64
(gdb) n
50          __lib_pthread_create = dlsym(RTLD_NEXT, "pthread_create");
(gdb)
47      {
(gdb)
50          __lib_pthread_create = dlsym(RTLD_NEXT, "pthread_create");
(gdb)
51          __lib_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock");
(gdb)
50          __lib_pthread_create = dlsym(RTLD_NEXT, "pthread_create");
(gdb)
51          __lib_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock");
(gdb)
52          __lib_pthread_mutex_trylock = dlsym(RTLD_NEXT, "pthread_mutex_trylock");
(gdb)
51          __lib_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock");
(gdb)
52          __lib_pthread_mutex_trylock = dlsym(RTLD_NEXT, "pthread_mutex_trylock");
(gdb)
53          __lib_pthread_mutex_unlock = dlsym(RTLD_NEXT, "pthread_mutex_unlock");
(gdb) n
52          __lib_pthread_mutex_trylock = dlsym(RTLD_NEXT, "pthread_mutex_trylock");
(gdb)
53          __lib_pthread_mutex_unlock = dlsym(RTLD_NEXT, "pthread_mutex_unlock");
(gdb)
54          __lib_pthread_detach = dlsym(RTLD_NEXT, "pthread_detach");
(gdb)
53          __lib_pthread_mutex_unlock = dlsym(RTLD_NEXT, "pthread_mutex_unlock");
(gdb)
54          __lib_pthread_detach = dlsym(RTLD_NEXT, "pthread_detach");
(gdb)
56          if (__lib_pthread_mutex_lock == NULL || __lib_pthread_mutex_unlock == NULL ||
(gdb) n
54          __lib_pthread_detach = dlsym(RTLD_NEXT, "pthread_detach");
(gdb)
56          if (__lib_pthread_mutex_lock == NULL || __lib_pthread_mutex_unlock == NULL ||
(gdb)
57                  __lib_pthread_create == NULL || __lib_pthread_mutex_trylock == NULL ||
(gdb)
65      }
(gdb)
pthread_mutex_trylock (mutex=0x7ffff7bbf2c0 <mtxInit>) at /home/sbl/Quartz/quartz-test/quartz-master/src/lib/interpose.c:163
163     }
(gdb) p __lib_pthread_mutex_lock
$1 = (int (*)(pthread_mutex_t *)) 0x7ffff77a7bd0 <pthread_mutex_lock>
(gdb) p pthread_mutex_lock
$2 = {int (pthread_mutex_t *)} 0x7ffff7bc6b50 <pthread_mutex_lock>
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007ffff7dddaf0  0x00007ffff7df7520  Yes (*)     /lib64/ld-linux-x86-64.so.2
0x00007ffff7bc5900  0x00007ffff7bcfa1c  Yes         ../src/lib/libnvmemul.so
0x00007ffff79bc0f0  0x00007ffff79be198  Yes         /lib64/libachk.so
0x00007ffff77a38a0  0x00007ffff77ae784  Yes (*)     /lib64/libpthread.so.0
0x00007ffff74f0510  0x00007ffff75575aa  Yes (*)     /lib64/libstdc++.so.6
0x00007ffff7198460  0x00007ffff7202988  Yes (*)     /lib64/libm.so.6
0x00007ffff6f7faf0  0x00007ffff6f8f298  Yes (*)     /lib64/libgcc_s.so.1
0x00007ffff6bdb3b0  0x00007ffff6d2014f  Yes (*)     /lib64/libc.so.6
0x00007ffff69b8e60  0x00007ffff69b9960  Yes (*)     /lib64/libdl.so.2
0x00007ffff67aef60  0x00007ffff67b3e60  Yes (*)     /lib64/libconfig.so.9
0x00007ffff65a3440  0x00007ffff65a7e04  Yes (*)     /lib64/libnuma.so.1
0x00007ffff639a250  0x00007ffff639d04c  Yes (*)     /lib64/librt.so.1
0x00007ffff6179090  0x00007ffff618fc5c  Yes (*)     /lib64/libgomp.so.1
(*): Shared library is missing debugging information.
(gdb)

By the way, server files in the test/ include "gtest/gtest.h", which does not exist in the source code. I simply commented this line to avoid compiling errors.

SunnyBeike commented 7 years ago

Here is how I modify the source code:

src/lib/init.c, init()

int g_start_flag = 0;  // Avoid using the pre_load library while initialization
void init() {
     ...
     unsetenv("LD_PRELOAD");
    g_start_flag = 1;
     ...
    if(ld_preload_path)
          setenv("LD_PRELOAD", ld_preload_path, 1);
     g_start_flag = 0;
     return;
}

src/lib/interpose.c

extern int g_start_flag;
/*These per thread variabilities are used to avoid recursion*/
pthread_once_t once_create = PTHREAD_ONCE_INIT;
pthread_once_t once_lock = PTHREAD_ONCE_INIT;
pthread_once_t once_trylock = PTHREAD_ONCE_INIT;
pthread_once_t once_unlock = PTHREAD_ONCE_INIT;

static __thread int no_hook_create = 1;
void once_no_hook_create() {
    no_hook_create = 1;
}
static __thread int no_hook_lock = 1;
void once_no_hook_lock() {
    no_hook_lock = 1;
}
static __thread int no_hook_trylock = 1;
void once_no_hook_trylock() {
    no_hook_trylock = 1;
}
static __thread int no_hook_unlock = 1;
void once_no_hook_unlock() {
    no_hook_unlock = 1;
}
int pthread_create(pthread_t *thread, const pthread_attr_t *attr,
                   void *(*start_routine) (void *), void *arg)
{
    int ret;

    pthread_once(&once_create, once_no_hook_create);
    pthread_once(&once_lock, once_no_hook_lock);
    pthread_once(&once_trylock, once_no_hook_trylock);
    pthread_once(&once_unlock, once_no_hook_unlock);
    //assert(__lib_pthread_create);
    if (__lib_pthread_create == NULL)
        init_interposition();

    if(g_start_flag == 1) {

            return __lib_pthread_create(thread, attr, start_routine, arg);
    }
    if(no_hook_create) {
            ret = __lib_pthread_create(thread, attr, start_routine, arg);
        no_hook_create = 0;
        return ret;
    }
    no_hook_create = 1;

    if (latency_model.enabled) {
        pthread_create_functor_t *functor = malloc(sizeof(pthread_create_functor_t));
        functor->arg = arg;
        functor->start_routine = start_routine;

        if ((ret = __lib_pthread_create(thread, attr, __interposed_start_routine, (void*) functor)) != 0) {
            DBG_LOG(ERROR, "call to __lib_pthread_create failed\n");
            return ret;
        }
    } else {
        ret = __lib_pthread_create(thread, attr, start_routine, arg);
    }
    no_hook_create = 0;

    return ret;    
}

int pthread_mutex_lock(pthread_mutex_t *mutex)
{
    int err;

    DBG_LOG(DEBUG, "\nEntering: %s\n", __func__)
    if (__lib_pthread_mutex_lock == NULL)
        init_interposition();

    if(g_start_flag == 1) {
        err =  __lib_pthread_mutex_lock(mutex);
        return err; 
    }
    if(no_hook_lock) {
        err =  __lib_pthread_mutex_lock(mutex);
        no_hook_lock = 0;
        return err;
    }
    no_hook_lock = 1;

    if (latency_model.enabled) {
        if(reached_min_epoch_duration(thread_self())) {
            // create new epoch here in order to propagate only the critical session delay to other threads
            // the thread monitor will keep trying to create new epoch, unless the min duration has not been reached
            create_latency_epoch();
        }
    }

    //DBG_LOG(DEBUG, "interposing pthread_mutex_lock\n");
    err =  __lib_pthread_mutex_lock(mutex);

    no_hook_lock = 0;

    return err;
}

int pthread_mutex_trylock(pthread_mutex_t *mutex)
{
    int err;

    DBG_LOG(DEBUG, "\nEntering: %s\n", __func__)
    if (__lib_pthread_mutex_trylock == NULL)
        init_interposition();

    if(g_start_flag == 1) {

        return __lib_pthread_mutex_trylock(mutex);
    }
    if(no_hook_trylock) {
        err =  __lib_pthread_mutex_trylock(mutex);
        no_hook_trylock = 0;
        return err;
    }
    no_hook_trylock = 1;

  if (latency_model.enabled) {
        if(reached_min_epoch_duration(thread_self())) {
            create_latency_epoch();
        }
    }

    //DBG_LOG(DEBUG, "interposing pthread_mutex_trylock\n");

    err =  __lib_pthread_mutex_trylock(mutex);

    no_hook_trylock = 0;

    return err;
}

int pthread_mutex_unlock(pthread_mutex_t *mutex)
{
    int err;

    DBG_LOG(DEBUG, "\nEntering: %s\n", __func__)
    if (__lib_pthread_mutex_unlock == NULL)
        init_interposition();

    if(g_start_flag == 1) {

        return  __lib_pthread_mutex_unlock(mutex);
    }

    if(no_hook_unlock) {
        err =  __lib_pthread_mutex_unlock(mutex);
        no_hook_unlock = 0;
        return err;
    }
    no_hook_unlock = 1;

    if (latency_model.enabled) {
        if (reached_min_epoch_duration(thread_self())) {
            create_latency_epoch();
        }
    }

    //DBG_LOG(DEBUG, "interposing pthread_mutex_unlock\n");
    err = __lib_pthread_mutex_unlock(mutex);

    no_hook_unlock = 0;

    return err;
}
SunnyBeike commented 7 years ago

I found the reason why redis fail to start with Quartz.

Quartz modified the environment at first. Redis detects the changes, and try to reset the environment by freeing the related memory (see spt_init() in redis), which is accessed later. This is the reason why segfault is reported when running redis with Quartz.

I think it is unsafe to use the pre_load method for some applications, which may have conflicted environment settings with Quartz (of cource, we should also modify this bug in redis).

SunnyBeike commented 7 years ago

By the way, at the start of register_thread(), the allocated memory will not be freed if thread_manager is NULL.

int register_thread(thread_manager_t* thread_manager, pthread_t pthread, pid_t tid)
{
    int ret = 0;
    int cpu_id;
    int virtual_node_id;
    thread_t* thread = malloc(sizeof(thread_t));

    if (thread_manager == NULL) {
        // this is possible if both BW and latency modeling are enabled and the BW model is not yet created.
        // the BW modeling will spawn threads which will attempt to register with the thread manager if the
        // latency modeling is enabled. However the thread manager is instantiated later.
        //goto error;
        return E_SUCCESS;
    }

I think the memory should be freed (free(thread)) before "return E_SUCCESS". Also, the it would be better to delete the "goto error" comment.

hvolos commented 7 years ago

I understand the fix, but it's not obvious to me why it gets into endless recursion. Does this endless recursion happen when running inside gdb?

Based on the debugging information you provide the pthread_mutex_lock corresponds to the Quartz method (falls within the nvmemul.so) and the __lib_pthread_mutex_lock corresponds to the posix threads method (falls within libpthread.so). pthread_mutex_lock calls __lib_phtread_mutex_lock which points to a different method so I don't really see why it gets into a recursion.

I am wondering whether there is a race, because the call to init_interposition is not protected by a lock.

SunnyBeike commented 7 years ago

Since init_interposition() is called when __libpthread* is NULL, it is quite complicated to check whether there is a race while initiating interposition. I am afraid, we have to unset the environment every time before init_interposition() is called, in the way like init(). However, I don't know why it doesn't work for me, while work for some other's computer.

SunnyBeike commented 7 years ago

Hi, I tried to avoid the race of init_interposition() in the following way. But it didn't solve the problem.

int pthread_create() {
 ...
   if (__lib_pthread_create == NULL) {
        //init_interposition();
        __lib_pthread_create = dlsym(RTLD_NEXT, "pthread_create");
    }
...
}

int pthread_mutex_lock() {
 ...
   if (__lib_pthread_mutex_lock == NULL) {
        //init_interposition();
        __lib_pthread_mutex_lock = dlsym(RTLD_NEXT, "pthread_mutex_lock");
    }
...
}

int pthread_mutex_trylock() {
 ...
   if (__lib_pthread_mutex_trylock == NULL) {
        //init_interposition();
        __lib_pthread_mutex_trylock = dlsym(RTLD_NEXT, "pthread_mutex_trylock");
    }
...
}

int pthread_mutex_unlock() {
 ...
   if (__lib_pthread_mutex_unlock == NULL) {
        //init_interposition();
        __lib_pthread_mutex_unlock = dlsym(RTLD_NEXT, "pthread_mutex_unlock");
    }
...
}
hvolos commented 7 years ago

Although I actually expect init_interposition() to be called when the library is loaded, could you try modifying the test and run this with a single thread to completely rule out any races?

Since we established through GDB that symbols pthread_mutex_lock and __lib_pthread_mutex_lock point to the right methods (program counters), I am still confused about the origin of the recursive call. Perhaps you could add a breakpoint (or extend your fix to add a call to dbg_backtrace provided by debug.h) when you detect a recursive call to see where that call originates from.

SunnyBeike commented 7 years ago

Hi, I tested with a single thread application. It doesn't work unless I use the g_start_flag to avoid using the pre_load library while initiating.

hvolos commented 7 years ago

I installed CentOS 7 and tried to reproduce the error you are facing, but I was unsuccessful. I used the default 3.10 kernel and the most recent kernel 4.11.

SunnyBeike commented 7 years ago

That is quite strange. I tried Kernel4.10 in CentOS 7. And the libc version is libc-2.17.so. Since the OS has been used for a long time, there may exist some other unknown changes.