OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.85k stars 2.53k forks source link

vsicurl: sigpipe during gdalwarp cleanup #9677

Closed rcoup closed 6 months ago

rcoup commented 6 months ago

What is the bug?

We've recently run into a new curl/libssl/SIGPIPE issue with gdalwarp:

gdalwarp --config GDAL_CACHEMAX 256 -wm 128 -multi -of GTiff -co TILED=YES -co COMPRESS=DEFLATE -setci -ovr AUTO-1 -r nearest -cvmd '' -te -11427641 3600489 -11271098 3757032 -tr 9 9 /path/to/in.vrt /path/to/out.tif

In this case in.vrt is a VRT around a single RGB GTiff on S3 via /vsis3/ and the VRT applies a CRS transform. It gets right to the end of the warp, outputs 100% progress, and then gets a SIGPIPE and quits (13).

...
.100 - done.
[Thread 0x7ffff0ac9640 (LWP 1423) exited]
[Thread 0x7fffebfff640 (LWP 1424) exited]

Thread 1 "gdalwarp" received signal SIGPIPE, Broken pipe.

This doesn't happen reliably, but often enough doing hundreds of warps. Seems to help when the node is under load or doing many such warps in parallel.

Backtrace from GDB

#0  __GI___libc_write (nbytes=31, buf=0x55555566ed53, fd=6) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=6, buf=buf@entry=0x55555566ed53, nbytes=31) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007ffff5d63683 in sock_write (b=0x555555af99a0, in=0x55555566ed53 "\025\003\003", inl=31) at ../crypto/bio/bss_sock.c:141
#3  0x00007ffff5d5331b in bwrite_conv (bio=<optimized out>, data=<optimized out>, datal=<optimized out>, written=0x7fffffff9850) at ../crypto/bio/bio_meth.c:77
#4  0x00007ffff5d56c26 in bio_write_intern (b=0x555555af99a0, data=0x55555566ed53, dlen=31, written=written@entry=0x7fffffff9890) at ../crypto/bio/bio_lib.c:362
#5  0x00007ffff5d56d57 in BIO_write (dlen=<optimized out>, data=<optimized out>, b=<optimized out>) at ../crypto/bio/bio_lib.c:384
#6  BIO_write (b=<optimized out>, data=<optimized out>, dlen=<optimized out>) at ../crypto/bio/bio_lib.c:376
#7  0x00007ffff47c5e5e in ssl3_write_pending (s=s@entry=0x555555b32960, type=type@entry=21, buf=buf@entry=0x555555b32b08 "\001", len=len@entry=2, written=written@entry=0x7fffffffaae8) at ../ssl/record/rec_layer_s3.c:1195
#8  0x00007ffff47c869b in do_ssl3_write (s=<optimized out>, type=21, buf=0x555555b32b08 "\001", pipelens=<optimized out>, numpipes=<optimized out>, create_empty_fragment=0, written=0x7fffffffaae8) at ../ssl/record/rec_layer_s3.c:1147
#9  0x00007ffff479c32f in ssl3_dispatch_alert (s=0x555555b32960) at ../ssl/s3_msg.c:84
#10 0x00007ffff47a6fb5 in ssl3_shutdown (s=0x555555b32960) at ../ssl/s3_lib.c:4394
#11 ssl3_shutdown (s=0x555555b32960) at ../ssl/s3_lib.c:4379
#12 0x00007ffff63222e8 in ossl_closeone.isra.0 (data=data@entry=0x5555559b5f00, conn=conn@entry=0x555555ac3e40, connssl=<optimized out>, connssl=<optimized out>) at vtls/openssl.c:1449
#13 0x00007ffff630fd16 in ossl_close (data=0x5555559b5f00, conn=0x555555ac3e40, sockindex=<optimized out>) at vtls/openssl.c:1467
#14 0x00007ffff6305e1a in Curl_ssl_close (sockindex=0, conn=0x555555ac3e40, data=0x5555559b5f00) at vtls/vtls.c:696
#15 conn_shutdown (conn=0x555555ac3e40, data=0x5555559b5f00) at /build/curl-zCVr3D/curl-7.81.0/debian/build/lib/url.c:757
#16 Curl_disconnect (data=0x5555559b5f00, conn=0x555555ac3e40, dead_connection=false) at /build/curl-zCVr3D/curl-7.81.0/debian/build/lib/url.c:876
#17 0x00007ffff62bcc96 in Curl_conncache_close_all_connections (connc=0x5555559b5e50) at /build/curl-zCVr3D/curl-7.81.0/debian/build/lib/conncache.c:554
#18 0x00007ffff62ea3f1 in curl_multi_cleanup (multi=0x5555559b5d50) at /build/curl-zCVr3D/curl-7.81.0/debian/build/lib/multi.c:2711
#19 curl_multi_cleanup (multi=0x5555559b5d50) at /build/curl-zCVr3D/curl-7.81.0/debian/build/lib/multi.c:2673
#20 0x00007ffff6cf582c in cpl::(anonymous namespace)::CachedConnection::clear (this=0x5555559b0ff8) at port/./port/cpl_vsil_curl.cpp:3580
#21 cpl::VSICurlFilesystemHandlerBase::ClearCache (this=0x555555601c50) at port/./port/cpl_vsil_curl.cpp:3840
#22 0x00007ffff6d4d8bd in cpl::VSIS3FSHandler::ClearCache (this=<optimized out>) at port/./port/cpl_vsil_s3.cpp:2173
#23 0x00007ffff6d4d8eb in cpl::VSIS3FSHandler::~VSIS3FSHandler (this=<optimized out>, this=<optimized out>) at port/./port/cpl_vsil_s3.cpp:2163
#24 0x00007ffff6d4d92d in cpl::VSIS3FSHandler::~VSIS3FSHandler (this=<optimized out>, this=<optimized out>) at port/./port/cpl_vsil_s3.cpp:2165
#25 0x00007ffff6cc75bf in VSIFileManager::~VSIFileManager (this=<optimized out>, this=<optimized out>) at port/./port/cpl_vsil.cpp:3128
#26 0x00007ffff6cd0499 in VSICleanupFileManager () at port/./port/cpl_vsil.cpp:3265
#27 0x00007ffff78b937e in GDALDriverManager::~GDALDriverManager (this=<optimized out>, this=<optimized out>) at gcore/./gcore/gdaldrivermanager.cpp:276
#28 0x00007ffff78b94ed in GDALDriverManager::~GDALDriverManager (this=<optimized out>, this=<optimized out>) at gcore/./gcore/gdaldrivermanager.cpp:163
#29 GDALDestroyDriverManager () at gcore/./gcore/gdaldrivermanager.cpp:1174
#30 0x00005555555568fd in main (argc=<optimized out>, argv=<optimized out>) at ./apps/gdalwarp_bin.cpp:405

Steps to reproduce the issue

The libcurl thread safety docs have:

When CURLOPT_NOSIGNAL is set to 1L, your application needs to deal with the risk of a SIGPIPE (that at least the OpenSSL backend can trigger). Note that setting CURLOPT_NOSIGNAL to 0L does not work in a threaded situation as there is a race condition where libcurl risks restoring the former signal handler while another thread should still ignore it.

Which is fine, GDAL sets CURLOPT_NOSIGNAL=1 and has its own CPLHTTPIgnoreSigPipe() and CPLHTTPRestoreSigPipeHandler(). The last sentence above was added after GDAL's handlers were written.

Poking about in cpl_vsil_curl.cpp...

  1. AFAICS there's nothing to prevent race conditions wrt setting/restoring the SIGPIPE handler across multiple threads? But that doesn't seem like the exact issue we're encountering.
  2. curl_multi_cleanup() is called after CPLHTTPRestoreSigPipeHandler() in VSICurlHandle::AdviseRead()
  3. AFAICS there's no SIGPIPE handling in CachedConnection::clear() or callers
  4. There's several curl_easy_cleanup() calls which don't appear to be wrapped either.

I don't know whether it's simpler just to save/restore one handler when curl is active, but I'm maybe missing some context where SIGPIPE is used elsewhere?

Versions and provenance

Additional context

No response

rouault commented 6 months ago

@rcoup hopefully https://github.com/OSGeo/gdal/pull/9681 should solve your issue?

AFAICS there's nothing to prevent race conditions wrt setting/restoring the SIGPIPE handler across multiple threads?

true, this all area is rather fragile. I don't think there's a real way of dealing with signals in a thread-safe way. The man page of signal() says "the effects of signal() in a multithreaded process are unspecified"... Ideally openssl shouldn't cause SIGPIPE to be triggered at all.

I don't know whether it's simpler just to save/restore one handler when curl is active, but I'm maybe missing some context where SIGPIPE is used elsewhere?

yes, installing signal() just once is perhaps safer (but only if signal() affects all threads). I don't think the rest of GDAL cares about SIGPIPE, but an application using GDAL might... We should probably have a GDAL_CURL_NOSIGNAL config option that an application could set so that GDAL doesn't try to mess with curl/openssl related signals...

rcoup commented 6 months ago

Sorry, been tied up with some other stuff. Thanks — I'll roll that change into an app build and let you know whether it eliminates it.