OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

Support asyncio patterns in python #3475

Open dazza-codes opened 3 years ago

dazza-codes commented 3 years ago

gdal should support asynchronous HTTP (and HTTP2 protocols) with asyncio patterns. For example, all /vsis3 reads are synchronous, with no support for asyncio patterns to await an s3 read. If the asynchronous patterns are not supported by libcurl, use a different dependency to support them or a wrapper to support them.

e.g. https://gist.github.com/owickstrom/3218376 e.g. https://stackoverflow.com/questions/11980311/libcurl-writecallback-async-c/11980430

(Apologies if gdal already supports asyncio patterns, happy to be corrected and pointed in the right direction. I don't use py-gdal directly, only rasterio wrappers on gdal.)

vincentsarago commented 3 years ago

@dazza-codes As far as I know there was some work made in ~2010 described in https://gdal.org/development/rfc/rfc24_progressive_data_support.html which was made for Jpeg2000/ecw format.

https://github.com/OSGeo/gdal/blob/633cfd90334ab604ad46c1cbcf36f083b27e49ab/autotest/gdrivers/ecw.py#L997-L1029

This was made specifically for some drivers and not for VSI*. I'm not even sure libcurl natively supports AsyncIO operation

libcurl has no asynchronous interface. You can do that yourself either by using threads or by using the non-blocking "multi interface" that libcurl offers. Read up on the multi interface here:

ref: https://curl.se/mail/lib-2002-05/0090.html

you may have to use something like https://github.com/jbaldwin/liblifthttp

rouault commented 3 years ago

gdal should support asynchronous HTTP (and HTTP2 protocols) with asyncio patterns

what would be the use case ? Most use of /vsicurl/ and similar network filesystems are currently done through other GDAL API which are at 99% blocking. As @vincentsarago pointed, there is a async raster API but it is only marginally used

geospatial-jeff commented 3 years ago

If GDAL API is blocking, async doesn't really add any value. Also I don't think the libcurl multi interface you reference is "truly async", it's more like client side multiplexing which still requires something waiting for the request to finish. In fact the idea of "truly async" programming languages is much newer than CURL so this isn't surprising to me.

A common way to implement "truly async" with CURL is with callbacks, as they allow the execution of code as requests finish, but I'm not familiar enough with GDAL to know if using callbacks would be problematic (usually it is ex. javascript callback hell).

dazza-codes commented 3 years ago

Suggestion: gdal might try to find and use an async lib and/or use C++11 std::async wrappers to support /aio* (to supplement /vsi*) for services that require an async client for HTTP/S, e.g. some related commentary in:

Suggestion to manage the experimental feature development (ignore if this is nonsense). The functionality could aim to provide additional features with no changes whatsoever to the existing libcurl and /vsi* functionality. Although I don't fully understand the intent of the /vsi* "namespace", perhaps a new "namespace" like /aio* could add experimental functionality to support async patterns, with exposure in python-swig to support asyncio. (Unfortunately I don't have time, nor know enough to begin a PR draft.)

With regard to use-cases, if it is not obvious already, e.g.:

kylebarron commented 3 years ago

I agree in an ideal world GDAL would have good support for async loading, but I'd guess it could be a considerable amount of work and might not happen without some dedicated funding.

Note that since you specifically reference Python and GeoTIFFs, you might want to follow the development of https://github.com/geospatial-jeff/aiocogeo

dazza-codes commented 3 years ago

Via other channels, I just bumped into https://github.com/geospatial-jeff/aiocogeo - it supports asyncio patterns