asfadmin / Discovery-asf_search

BSD 3-Clause "New" or "Revised" License
124 stars 43 forks source link

[Bug] Incorrect behavior when using `ThreadPoolExecutor` to download multiple files #282

Open forrestfwilliams opened 5 months ago

forrestfwilliams commented 5 months ago

Describe the bug In order to download files while specifying desired file names, you must use the download_url function. However, when using download_url in concert with concurrent.future's ThreadPoolExecutor, the path to which each dataset is downloaded becomes mangled. Depending on the random order in which products are ready, the products are downloaded to a random one of the specified filenames.

To Reproduce

from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
from pathlib import Path

import asf_search

urls = [
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240313T140832_20240313T140859_052964_06694F_90B4/IW1/VV/7.tiff',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240301T140832_20240301T140859_052789_06635B_791A/IW1/VV/7.tiff',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240313T140832_20240313T140859_052964_06694F_90B4/IW1/VV/7.xml',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240301T140832_20240301T140859_052789_06635B_791A/IW1/VV/7.xml',
]
paths = [
    Path('./burst_20240313.tif'),
    Path('./burst_20240301.tif'),
    Path('./burst_20240313.xml'),
    Path('./burst_20240301.xml'),
]

session = asf_search.ASFSession()
with ThreadPoolExecutor() as executor:
    executor.map(
        asf_search.download_url,
        urls,
        [x.parent for x in paths],
        [x.name for x in paths],
        repeat(session, len(urls)),
    )

Expected behavior The above should produce the same as:

from pathlib import Path

import asf_search

urls = [
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240313T140832_20240313T140859_052964_06694F_90B4/IW1/VV/7.tiff',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240301T140832_20240301T140859_052789_06635B_791A/IW1/VV/7.tiff',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240313T140832_20240313T140859_052964_06694F_90B4/IW1/VV/7.xml',
    'https://sentinel1-burst.asf.alaska.edu/S1A_IW_SLC__1SDV_20240301T140832_20240301T140859_052789_06635B_791A/IW1/VV/7.xml',
]
paths = [
    Path('./burst_20240313.tif'),
    Path('./burst_20240301.tif'),
    Path('./burst_20240313.xml'),
    Path('./burst_20240301.xml'),
]

session = asf_search.ASFSession()
for url, path in zip(urls, paths):
    asf_search.download_url(url, path.parent, path.name, session)
forrestfwilliams commented 5 months ago

Notably, using a separate session for every thread resolves the issue. However this solution is less than ideal because it creates a lot of unnecessary overhead.