Closed gmaze closed 1 month ago
One small test:
from argopy import ArgoIndex
import fsspec
import xarray as xr
from argopy.stores import httpstore
idx = ArgoIndex(index_file='bgc-s').load().search_wmo_cyc(6903091, np.arange(1,45))
urls = [idx.host + "/dac/" + str(f) for f in idx.search['file']]
Method 1:
%%time
fs = fsspec.filesystem("http")
out = fs.cat(urls) # fetches data concurrently
results = []
for url in out:
results.append(xr.open_dataset(out[url]))
>>> CPU times: user 1.2 s, sys: 240 ms, total: 1.44 s
>>> Wall time: 6.95 s
Method 2:
%%time
results = httpstore().open_mfdataset(urls, concat=False)
>>> CPU times: user 1.52 s, sys: 255 ms, total: 1.78 s
>>> Wall time: 5.3 s
what's taking time is the creation of xarray dataset, not data download,
so may be this is not where to search for performance improvement
This issue was marked as staled automatically because it has not seen any activity in 90 days
This may be a design already implemented in the
test_data
CLI used to populate CI tests data in mocked http servers.However, I wonder if we should do this when fetching a large amount of file from one of the GDAC servers (https and s3) ?
The fsspec http store is already asynchronous but I don't quite understand how is parallelisation implemented for multi-files download:
Our current option is to possibly use multithreading with the
parallel
option of the datafetcher, that is in httpstore.open_mfdataset. With this design, we apply pre/post-processing of Argo data on chunks in parallel, but that is different from downloading in parallel, then processing in parallel (possibly with another mechanism)eg: https://stackoverflow.com/questions/57126286/fastest-parallel-requests-in-python