OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.93k stars 2.57k forks source link

Docs clarification: /vsicurl?list_dir=no should actually be /vsicurl?empty_dir=yes #7163

Open scottyhq opened 1 year ago

scottyhq commented 1 year ago

Expected behavior and actual behavior.

https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access

Describes the option to not list directories https://github.com/OSGeo/gdal/blob/dfc719107e07c8e157cbcbba00c0676668b685a3/doc/source/user/virtual_file_systems.rst?plain=1#L239

But looking at log output list_dir=no doesn't do anything and instead empty_dir=yes has the intended affect:

Steps to reproduce the problem.

CPL_DEBUG=ON gdalinfo '/vsicurl?pc_url_signing=yes&list_dir=no&url=https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2021/045/031/LC08_L2SP_045031_20210107_20210307_02_T1/LC08_L2SP_045031_20210107_20210307_02_T1_ST_B10.TIF'

Operating system

OSX

GDAL version and provenance

gdal                      3.6.2           py311h619941e_3    conda-forge
libgdal                   3.6.2                h623d8b8_3    conda-forge
rouault commented 1 year ago

It does has an effect, but mostly seen when using low level I/O primitives, and not that much with gdalinfo that will try to probe side-car files even if the initial directory listing is disable.

Perhaps this could be rephrased as ?

Compare without list_dir=no, which attemps to do a GET to the directory of the file

$ CPL_CURL_VERBOSE=YES python -c "from osgeo import gdal; f = gdal.VSIFOpenL('/vsicurl?pc_url_signing=yes&url=https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2021/045/031/LC08_L2SP_045031_20210107_20210307_02_T1/LC08_L2SP_045031_20210107_20210307_02_T1_ST_B10.TIF', 'rb')"
* Couldn't find host landsateuwest.blob.core.windows.net in the .netrc file; using defaults
*   Trying 20.150.76.4:443...
* TCP_NODELAY set
* Connected to landsateuwest.blob.core.windows.net (20.150.76.4) port 443 (#0)
* found 376 certificates in /etc/ssl/certs
* ALPN, offering h2
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384
*    server certificate verification OK
*    server certificate status verification SKIPPED
*    common name: *.blob.core.windows.net (matched)
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: CN=*.blob.core.windows.net
*    start date: Sun, 25 Dec 2022 02:12:54 GMT
*    expire date: Mon, 25 Dec 2023 02:12:54 GMT
*    issuer: C=US,O=Microsoft Corporation,CN=Microsoft RSA TLS CA 02
* ALPN, server did not agree to a protocol
> GET /landsat-c2/level-2/standard/oli-tirs/2021/045/031/LC08_L2SP_045031_20210107_20210307_02_T1/ HTTP/1.1
Host: landsateuwest.blob.core.windows.net
User-Agent: GDAL/3.7.0
Accept: */*

* Mark bundle as not supporting multiuse
< HTTP/1.1 404 The specified resource does not exist.
< Content-Length: 223
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: 1db1b053-101e-0022-178b-381e13000000
< x-ms-version: 2014-02-14
< Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Length,Date,Transfer-Encoding
< Access-Control-Allow-Origin: *
< Date: Sat, 04 Feb 2023 11:22:58 GMT
< 
[....]

with list_dir=no where the file is directly accessed (actually the URL signing stuff)

$ CPL_CURL_VERBOSE=YES python -c "from osgeo import gdal; f = gdal.VSIFOpenL('/vsicurl?pc_url_signing=yes&list_dir=no&url=https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2021/045/031/LC08_L2SP_045031_20210107_20210307_02_T1/LC08_L2SP_045031_20210107_20210307_02_T1_ST_B10.TIF', 'rb')"
* Couldn't find host planetarycomputer.microsoft.com in the .netrc file; using defaults
*   Trying 2620:1ec:4f:1::42:443...
* TCP_NODELAY set
* Connected to planetarycomputer.microsoft.com (2620:1ec:4f:1::42) port 443 (#0)
* found 376 certificates in /etc/ssl/certs
* ALPN, offering h2
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*    server certificate verification OK
*    server certificate status verification SKIPPED
*    common name: planetarycomputer.microsoft.com (matched)
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: C=US,ST=Washington,L=Redmond,O=Microsoft Corporation,CN=planetarycomputer.microsoft.com
*    start date: Wed, 31 Aug 2022 00:00:00 GMT
*    expire date: Wed, 30 Aug 2023 23:59:59 GMT
*    issuer: C=US,O=DigiCert Inc,CN=DigiCert TLS RSA SHA256 2020 CA1
* ALPN, server accepted to use h2
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x271da90)
> GET /api/sas/v1/sign?href=https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2021/045/031/LC08_L2SP_045031_20210107_20210307_02_T1/LC08_L2SP_045031_20210107_20210307_02_T1_ST_B10.TIF HTTP/2
Host: planetarycomputer.microsoft.com
user-agent: GDAL/3.7.0
accept: */*
accept-encoding: gzip

* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 200 
< date: Sat, 04 Feb 2023 11:24:25 GMT
< content-type: application/json
< content-length: 531
< strict-transport-security: max-age=15724800; includeSubDomains
< request-context: appId=cid-v1:75161b1b-6883-4b66-9410-715040c44427
< x-azure-ref: 20230204T112425Z-zdmpvd196551r6z8qen6retaf400000001q0000000001t6v
< x-cache: CONFIG_NOCACHE
< accept-ranges: bytes
[...]

Seeing this, if pc_url_signing=yes is set, we should actually likely automatically disable directory listing as it can't work

scottyhq commented 1 year ago

setting list_dir=no does not prevent higher level logic in GDAL drivers to probe for individual side-car files

Thanks for the clarification @rouault!

if pc_url_signing=yes is set, we should actually likely automatically disable directory listing as it can't work.

Makes sense to me, for what it's worth the Planetary Computer JupyterHub automatically sets GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR.

That said, is there a reason not to reuse the list_dir key and add the additional value option for empty_dir list_dir=yes|no|empty_dir? Just from the docs it's not clear if all of these URL modifiers have corresponding environment variables and override them. Happy to submit a PR to clarify the wording if that is helpful.

rouault commented 1 year ago

That said, is there a reason not to reuse the list_dir key and add the additional value option for empty_dir list_dir=yes|no|empty_dir?

well, the GDAL_DISABLE_READDIR_ON_OPEN=YES/NO/EMPTY_DIR naming is quite hard to comprehend (double negations, non-boolean value EMPTY_DIR put in something where a boolean is expected from the DISABLE), so the list_dir=yes/no & empty_dir=yes/no split was an (apparently bad) attempt at making things easier to comprehend.

Happy to submit a PR to clarify the wording if that is helpful.

welcome