OvertureMaps / overturemaps-py

overture-py
MIT License
137 stars 19 forks source link

Configurable Timeouts when Using as a Library #44

Open kevinkreiser opened 1 month ago

kevinkreiser commented 1 month ago

hi! thanks for making this excellent tooling and making sure the data is accessible to everyone who wants to use it. im currently installing this repo's pip package and calling core.geodataframe to pull a given layer from the data in my own scripting.

i have a very crappy internet connection (over the cellular network). im not exactly sure how the geoparquet format works in terms of deciding which bits of the files it needs to pull out but it must do a decent amount of back and forth (http range requests probably?) when fetching data for a given bbox. what im seeing locally is all kinds of timeout errors. they seem to vary slightly but the bulk of them look similar to the following:

IOError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached. Detail: Python exception: Traceback (most recent call last):
  File "/home/kk/scratch/venv/lib/python3.10/site-packages/overturemaps/core.py", line 45, in <genexpr>
    non_empty_batches = (b for b in batches if b.num_rows > 0)
  File "pyarrow/_dataset.pyx", line 3769, in _iterator
  File "pyarrow/_dataset.pyx", line 3387, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

do you have any advice for this scenario? can i configure the timeout to give my request a bit longer to do the back and forth to get the data i need from a given parquet file? ive not yet seen if i can control curl, aws or pyarrow externally but will research that more shortly. thanks in advance!

i should also mention i found a similar issue which talks about some different potential problems with aws configuration https://github.com/apache/arrow/issues/36007

EDIT:

ive modified the bit in core.py:record_batch_reader to set timeouts:

    s3_options = { 'anonymous': True, 'region': 'us-west-2',
                   'connect_timeout': 60, 'request_timeout': 120
    }
    dataset = ds.dataset(
        path, filesystem=fs.S3FileSystem(**s3_options)
    )

and sadly the result is a different error:

AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 56, Failure when receiving data from the peer

perhaps this is aws itself hanging up on me because i am too slow?

EDIT:

it seems if i reduce the amount of parallelism to 1, that is to say no parallelism: 1 overture type with 1 bbox at a time, then it will consistently return results. maybe this is because the aws (boto) apis are doing parallelism underneath and im pushing the limits on it?