Crash when opening same NC4 file from different threads BUT under global mutex

rouault commented 2 years ago

This is an attempt at providing a minimum reproducer for https://github.com/OSGeo/gdal/issues/6253

The attached docker.zip contains a simple Dockerfile building libhdf5 and libnetcdf, and a simple C++ program. The C++ program loops at creating 2 threads, which open the same NC4 file and one calls nc_inq_varname(). It is to be noted that all calls to the netCDF API are protected by a common mutex , so there's no concurrent access to the netCDF API.

I've tried different versions of hdf5 and netcdf, and compiling hdf5 with or without --enable-unsupported --enable-threadsafe, but the crash always occur

How to reproduce:

unzip docker.zip
cd docker
docker build -t netcdf_issue .
docker run --rm -it netcdf_issue:latest  /usr/bin/test /alldatatypes.nc
docker run --rm -it netcdf_issue:latest  valgrind /usr/bin/test /alldatatypes.nc || echo "this failed!"

results in

T1 begin
T2 begin
T1 open 65536
T2 open 131072
T1 close 65536
T1 end
T2 begin nc_inq_varname(131072, 15)
this failed!

and under valgrind:

$ docker run --rm -it netcdf_issue:latest  valgrind /usr/bin/test /alldatatypes.nc
[...]
==1== Thread 3:
==1== Invalid read of size 1
==1==    at 0x4F921F8: H5F_addr_decode (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519A1F6: H5VL__native_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x51922AC: H5VL_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x517F48F: H5T__vlen_disk_isnull (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x510387B: H5T__conv_vlen (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x50EDD6B: H5T_convert (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F57159: H5D_get_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519B3D4: H5VL__native_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x5186A77: H5VL_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F2772A: H5Dget_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x48EDCB5: nc4_get_var_meta (in /usr/lib/libnetcdf.so.19.1.0)
==1==    by 0x48EAC09: nc4_hdf5_find_grp_var_att (in /usr/lib/libnetcdf.so.19.1.0)
==1==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==1== 
==1== 
==1== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1==  Access not within mapped region at address 0x40
==1==    at 0x4F921F8: H5F_addr_decode (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519A1F6: H5VL__native_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x51922AC: H5VL_blob_specific (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x517F48F: H5T__vlen_disk_isnull (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x510387B: H5T__conv_vlen (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x50EDD6B: H5T_convert (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F57159: H5D_get_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x519B3D4: H5VL__native_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x5186A77: H5VL_dataset_get (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x4F2772A: H5Dget_create_plist (in /usr/lib/libhdf5.so.302.0.0)
==1==    by 0x48EDCB5: nc4_get_var_meta (in /usr/lib/libnetcdf.so.19.1.0)
==1==    by 0x48EAC09: nc4_hdf5_find_grp_var_att (in /usr/lib/libnetcdf.so.19.1.0)
==1==  If you believe this happened as a result of a stack

Dockerfile:

FROM ubuntu:20.04
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --fix-missing --no-install-recommends g++ make autoconf automake zlib1g-dev wget tar ca-certificates
RUN wget https://github.com/HDFGroup/hdf5/archive/refs/tags/hdf5-1_13_2.tar.gz && \
    tar xvzf hdf5-1_13_2.tar.gz && \
    cd hdf5-hdf5-1_13_2 && \
    #  --enable-unsupported --enable-threadsafe
    ./configure --prefix=/usr --disable-static --disable-tests && \
    make -j$(nproc) install
RUN wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz && \
    tar xvzf v4.9.0.tar.gz && \
    cd netcdf-c-4.9.0 && \
    ./configure --prefix=/usr && \
    make -j$(nproc) install
COPY test.cpp /
COPY alldatatypes.nc /
RUN g++ /test.cpp -lnetcdf -lpthread -o /usr/bin/test
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --fix-missing --no-install-recommends valgrind

test.cpp:

#include <thread>
#include <mutex>
#include <chrono>

#include "netcdf.h"

std::mutex oMutex;

int main(int argc, char* argv[])
{
    for (int i = 0; i < 1000; i++) {
        std::thread t1([argv] {
            int cdfid;
            printf("T1 begin\n");
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_open(argv[1], NC_NOWRITE, &cdfid);
                printf("T1 open %d\n", cdfid);
            }
            std::this_thread::sleep_for(std::chrono::microseconds(1));
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_close(cdfid);
                printf("T1 close %d\n", cdfid);
            }
            printf("T1 end\n");
        });
        std::thread t2([argv] {
            int cdfid2;
            printf("T2 begin\n");
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_open(argv[1], NC_NOWRITE, &cdfid2);
                printf("T2 open %d\n", cdfid2);
            }
            std::this_thread::sleep_for(std::chrono::microseconds(1));
            {
                int nVarId = 0;
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_inq_varid(cdfid2, "string_var", &nVarId);
                char szName[NC_MAX_NAME + 1] = {};
                printf("T2 begin nc_inq_varname(%d, %d)\n", cdfid2, nVarId);
                nc_inq_varname(cdfid2, nVarId, szName);
                printf("T2 end nc_inq_varname(%d, %d)\n", cdfid2, nVarId);
            }
            {
                std::lock_guard<std::mutex> oLock(oMutex);
                nc_close(cdfid2);
                printf("T2 close %d\n", cdfid2);
            }
            printf("T2 end\n");

        });
        t1.join();
        t2.join();
    }
    printf("success!\n");
    return 0;
}

lnicola commented 2 years ago

Possibly related, we're seeing a similar crash when opening different files: https://github.com/georust/gdal/issues/299. One difference is that there it seems to overflow the stack while trying to print a stack trace (but that's just a guess, I'm not familiar with the two libraries).

DennisHeimbigner commented 2 years ago

Do you know if you are using pthreads for your mutex?

rouault commented 2 years ago

Do you know if you are using pthreads for your mutex?

using std::thread requires linking the code with -lpthread. Apparently the std::mutex implementation on Linux / gcc also uses pthread_mutex_lock() / pthread_mutex_unlock() underneeth.

cf

$ cat test2.cpp 

#include <mutex>

std::mutex x;
int y = 0;
int foo()
{
    std::lock_guard<std::mutex> locker(x);
    return ++y;
}

$ g++ -O2 test2.cpp -c
$ LC_ALL=C objdump -sdwxC test2.o

0000000000000000 <foo()>:
   0:   f3 0f 1e fa             endbr64 
   4:   41 54                   push   %r12
   6:   53                      push   %rbx
   7:   48 83 ec 08             sub    $0x8,%rsp
   b:   48 8b 1d 00 00 00 00    mov    0x0(%rip),%rbx        # 12 <foo()+0x12>  e: R_X86_64_REX_GOTPCRELX   __pthread_key_create-0x4
  12:   48 85 db                test   %rbx,%rbx
  15:   74 10                   je     27 <foo()+0x27>
  17:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi        # 1e <foo()+0x1e>  1a: R_X86_64_PC32   x-0x4
  1e:   e8 00 00 00 00          callq  23 <foo()+0x23>  1f: R_X86_64_PLT32  pthread_mutex_lock-0x4
  23:   85 c0                   test   %eax,%eax
  25:   75 2d                   jne    54 <foo()+0x54>
  27:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 2d <foo()+0x2d>  29: R_X86_64_PC32   y-0x4
  2d:   44 8d 60 01             lea    0x1(%rax),%r12d
  31:   44 89 25 00 00 00 00    mov    %r12d,0x0(%rip)        # 38 <foo()+0x38> 34: R_X86_64_PC32   y-0x4
  38:   48 85 db                test   %rbx,%rbx
  3b:   74 0c                   je     49 <foo()+0x49>
  3d:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi        # 44 <foo()+0x44>  40: R_X86_64_PC32   x-0x4
  44:   e8 00 00 00 00          callq  49 <foo()+0x49>  45: R_X86_64_PLT32  pthread_mutex_unlock-0x4
  49:   48 83 c4 08             add    $0x8,%rsp
  4d:   44 89 e0                mov    %r12d,%eax
  50:   5b                      pop    %rbx
  51:   41 5c                   pop    %r12
  53:   c3                      retq   
  54:   89 c7                   mov    %eax,%edi
  56:   e8 00 00 00 00          callq  5b <x+0x3b>  57: R_X86_64_PLT32  std::__throw_system_error(int)-0x4

To be noted that the original reproducer with GDAL uses a global lock that is a pthread_mutex_t I seem to have found a workaround on GDAL side in https://github.com/OSGeo/gdal/pull/6311 by re-using the same netcdf handle when sequences like nc_open("sam_file.nc", ....); nc_open("sam_file.nc", ....); are done

Unidata / netcdf-c

Crash when opening same NC4 file from different threads BUT under global mutex #2496