OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.87k stars 2.54k forks source link

Segmentation fault in OSRGetProjTLSContext (race condition?) #2691

Closed jjimenezshaw closed 4 years ago

jjimenezshaw commented 4 years ago

Expected behavior and actual behavior.

Opening many TIFF files in parallel sometimes produces a segmentation fault in proj code. I created a small piece of code that tries to replicate that behavior.

Increase the number of threads if it does not triggers the segfault.

75b6faf from #2616 (currently in master, initializing curpid in OSRPJContextHolder) is probably fixing (or hiding) the problem because what I am doing is a new thread, not a fork. However I think it still may happen when a fork is done. This code is tested in 3.1.0

Setting the environment variable PROJ_LIB to find proj.db does not help. Anyhow, the program just want to read the TIFF, and do not care about coordinates. I understand that proj.db is not needed in GDAL if you do not use any CRS.

Steps to reproduce the problem.

Sample code (no deletion is done, I know). "sample.tif" must be there, a normal TIFF file. (I usually read different files, but this makes the sample simpler)

#include "gdal_priv.h"
#include "cpl_conv.h"

#include <iostream>
#include <thread>
#include <vector>

void threadfunc(int i)
{
    std::cout << "thread: " << i << std::endl;
    GDALDataset  *poDataset;
    poDataset = (GDALDataset *) GDALOpen( "sample.tif", GA_ReadOnly );
    if( poDataset == NULL )
    {
        std::cout << " failed opening sample.tif !!" << std::endl;
        return;
    }

    auto metadata = poDataset->GetMetadata("SUBDATASETS");
}

int main()
{
    GDALAllRegister();
    std::vector<std::thread> v;

    for(int i = 0; i < 8; i++) {
        v.emplace_back(std::thread(threadfunc, i));
    }
    for(auto& t : v) {
        t.join();
    }

    std::cout << "ok" << std::endl;
    return 0;
}

Backtrace:

#0  tcache_get (tc_idx=1) at malloc.c:2943
#1  __GI___libc_malloc (bytes=40) at malloc.c:3050
#2  0x00007f9962928234 in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#3  0x00007f99628f6224 in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#4  0x00007f99628fa07f in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#5  0x00007f99628fa365 in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#6  0x00007f9962929679 in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#7  0x00007f99629aeaf2 in ?? () from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
#8  0x00007f9962e19c64 in osgeo::proj::io::DatabaseContext::Private::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, projCtx_t*) ()
   from /home/jshaw/work/proj/build/install/lib/libproj.so.19
#9  0x00007f9962e25519 in osgeo::proj::io::DatabaseContext::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, projCtx_t*) ()
   from /home/jshaw/work/proj/build/install/lib/libproj.so.19
#10 0x00007f9962e4ad44 in getDBcontext () from /home/jshaw/work/proj/build/install/lib/libproj.so.19
#11 0x00007f9962e4aeb1 in proj_context_get_database_path () from /home/jshaw/work/proj/build/install/lib/libproj.so.19
#12 0x00007f9965417071 in GetProjTLSContextHolder () at ogr_proj_p.cpp:178
#13 0x00007f99654170a9 in OSRGetProjTLSContext () at ogr_proj_p.cpp:190
#14 0x00007f9965834fc1 in GTiffDatasetGTIFNew (hTIFF=0x7f994c002970) at geotiff.cpp:10968
#15 0x00007f99658411ef in GTiffDataset::LoadGeoreferencingAndPamIfNeeded (this=0x7f994c003be0) at geotiff.cpp:14612
#16 0x00007f996584e1d8 in GTiffDataset::GetMetadata (this=0x7f994c003be0, pszDomain=0x55b14827bf80 "SUBDATASETS") at geotiff.cpp:18777
#17 0x000055b14827a421 in threadfunc (i=5) at test.cpp:19

Now I compiled proj in debug

#0  tcache_get (tc_idx=2) at malloc.c:2943
#1  __GI___libc_malloc (bytes=bytes@entry=47) at malloc.c:3050
#2  0x00007f58c0dbc298 in operator new (sz=47) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:50
#3  0x00007f58c0e4dac7 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=this@entry=0x7f58b7dd75e0, 
    __str="/home/jshaw/work/proj/build/install/share/proj") at /build/gcc-8-quTRg6/gcc-8-8.4.0/build/x86_64-linux-gnu/libstdc++-v3/include/bits/char_traits.h:352
#4  0x00007f58c0e4de19 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__str=..., this=0x7f58b7dd75e0)
    at /build/gcc-8-quTRg6/gcc-8-8.4.0/build/x86_64-linux-gnu/libstdc++-v3/include/bits/basic_string.h:1364
#5  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (this=0x7f58b7dd75e0, __str=...)
    at /build/gcc-8-quTRg6/gcc-8-8.4.0/build/x86_64-linux-gnu/libstdc++-v3/include/bits/basic_string.h:695
#6  0x00007f58bf5fa1df in pj_open_lib_internal (ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>, name=0x7f58bf8a0a5b "proj.db", mode=0x7f58bf882d56 "rb", 
    open_file=0x7f58bf5fa884 <pj_open_file_with_manager(projCtx, char const*, char const*)>, out_full_filename=0x7f58a0009d60 "", out_full_filename_size=2047)
    at /home/jshaw/work/proj/src/filemanager.cpp:1496
#7  0x00007f58bf5fb824 in pj_find_file (ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>, short_filename=0x7f58bf8a0a5b "proj.db", out_full_filename=0x7f58a0009d60 "", 
    out_full_filename_size=2047) at /home/jshaw/work/proj/src/filemanager.cpp:1734
#8  0x00007f58bf7fdc8d in osgeo::proj::io::DatabaseContext::Private::open (this=0x7f58a0009800, databasePath="", ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>)
    at /home/jshaw/work/proj/src/iso19111/factory.cpp:507
#9  0x00007f58bf80027e in osgeo::proj::io::DatabaseContext::create (databasePath="", auxiliaryDatabasePaths=std::vector of length 0, capacity 0, ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>)
    at /home/jshaw/work/proj/src/iso19111/factory.cpp:854
#10 0x00007f58bf846b74 in projCppContext::getDatabaseContext (this=0x7f58a00096e0) at /home/jshaw/work/proj/src/iso19111/c_api.cpp:145
#11 0x00007f58bf846c5d in getDBcontext (ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>) at /home/jshaw/work/proj/src/iso19111/c_api.cpp:156
#12 0x00007f58bf847570 in proj_context_get_database_path (ctx=0x7f58bfbc9600 <pj_get_default_ctx::default_context>) at /home/jshaw/work/proj/src/iso19111/c_api.cpp:318
#13 0x00007f58c1edd071 in GetProjTLSContextHolder () at ogr_proj_p.cpp:178
#14 0x00007f58c1edd0a9 in OSRGetProjTLSContext () at ogr_proj_p.cpp:190
#15 0x00007f58c22fafc1 in GTiffDatasetGTIFNew (hTIFF=0x7f58a0002970) at geotiff.cpp:10968
#16 0x00007f58c23071ef in GTiffDataset::LoadGeoreferencingAndPamIfNeeded (this=0x7f58a0003be0) at geotiff.cpp:14612
#17 0x00007f58c23141d8 in GTiffDataset::GetMetadata (this=0x7f58a0003be0, pszDomain=0x55de85dfdf80 "SUBDATASETS") at geotiff.cpp:18777
#18 0x000055de85dfc421 in threadfunc (i=3) at test.cpp:19

In my complex application, I got a similar (but not the same) backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f6220b1e801 in __GI_abort () at abort.c:79
#2  0x00007f6220b67897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f6220c94b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f6220b6e90a in malloc_printerr (str=str@entry=0x7f6220c96890 "double free or corruption (!prev)") at malloc.c:5350
#4  0x00007f6220b75e84 in _int_free (have_lock=0, p=0x7f613801aaa0, av=0x7f6138000020) at malloc.c:4281
#5  __GI___libc_free (mem=0x7f613801aab0) at malloc.c:3124
#6  0x00007f621f65ed12 in std::_Sp_counted_ptr<osgeo::proj::io::DatabaseContext*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/jshaw/.conan/data/proj/7.0.1-2/pix4d/stable/package/80fc3deacdacad58d470d6b9fd5c4f054574d0a3/lib/libproj.so.19
#7  0x00007f621f665ff5 in proj_context_get_database_path () from /home/jshaw/.conan/data/proj/7.0.1-2/pix4d/stable/package/80fc3deacdacad58d470d6b9fd5c4f054574d0a3/lib/libproj.so.19
#8  0x00007f62245f6c58 in GetProjTLSContextHolder () at ogr_proj_p.cpp:178
#9  0x00007f62245f6e49 in OSRGetProjTLSContext () at ogr_proj_p.cpp:190
#10 0x00007f6223e13cec in GTiffDatasetGTIFNew (hTIFF=<optimized out>) at geotiff.cpp:10968
#11 0x00007f6223e2195c in GTiffDataset::LoadGeoreferencingAndPamIfNeeded (this=this@entry=0x7f613401f940) at geotiff.cpp:14612
#12 0x00007f6223e312d0 in GTiffDataset::GetMetadata (this=0x7f613401f940, pszDomain=0x563723115403 "SUBDATASETS") at geotiff.cpp:18777

or also this

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f25c132e801 in __GI_abort () at abort.c:79
#2  0x00007f25c1377897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f25c14a4b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f25c137e90a in malloc_printerr (str=str@entry=0x7f25c14a2c9d "corrupted size vs. prev_size") at malloc.c:5350
#4  0x00007f25c138615f in _int_free (have_lock=0, p=<optimized out>, av=0x7f2514000020) at malloc.c:4295
#5  __GI___libc_free (mem=<optimized out>) at malloc.c:3124
#6  0x00007f25bfe6ed12 in std::_Sp_counted_ptr<osgeo::proj::io::DatabaseContext*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/jshaw/.conan/data/proj/7.0.1-2/pix4d/stable/package/80fc3deacdacad58d470d6b9fd5c4f054574d0a3/lib/libproj.so.19
#7  0x0000562911b1d8f6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#8  0x00007f25bfe75d81 in getDBcontext () from /home/jshaw/.conan/data/proj/7.0.1-2/pix4d/stable/package/80fc3deacdacad58d470d6b9fd5c4f054574d0a3/lib/libproj.so.19
#9  0x00007f25bfe75eb1 in proj_context_get_database_path () from /home/jshaw/.conan/data/proj/7.0.1-2/pix4d/stable/package/80fc3deacdacad58d470d6b9fd5c4f054574d0a3/lib/libproj.so.19
#10 0x00007f25c4e06c58 in GetProjTLSContextHolder () at ogr_proj_p.cpp:178
#11 0x00007f25c4e06e49 in OSRGetProjTLSContext () at ogr_proj_p.cpp:190
#12 0x00007f25c4623cec in GTiffDatasetGTIFNew (hTIFF=<optimized out>) at geotiff.cpp:10968
#13 0x00007f25c463195c in GTiffDataset::LoadGeoreferencingAndPamIfNeeded (this=this@entry=0x7f252c01fb10) at geotiff.cpp:14612
#14 0x00007f25c46412d0 in GTiffDataset::GetMetadata (this=0x7f252c01fb10, pszDomain=0x562911f16403 "SUBDATASETS") at geotiff.cpp:18777

Operating system

Ubuntu 18.04 64 bit

GDAL version and provenance

GDAL 3.1.0 compiled from source code. PROJ 7.0.1 compiled from source code.

rouault commented 4 years ago

I believe https://github.com/OSGeo/gdal/commit/75b6fafc46cef6b4130d7e39cbc136c98cb23e7c (and its backport to 3.1 branch for 3.1.1) effectively fixed the issue for threads. I couldn't reproduce it, whereas if assigning 0 to curpid instead of getpid() as before the fix, I could reproduce it. And your use case actually revealed what was wrong: we were using the PROJ nullptr default context, which isn't thread-safe. In e873aa2, I've pushed a change to make the code hopefully a bit more robust.