Open gmaze opened 2 months ago
One small test:
from argopy import ArgoIndex
import fsspec
import xarray as xr
from argopy.stores import httpstore
idx = ArgoIndex(index_file='bgc-s').load().search_wmo_cyc(6903091, np.arange(1,45))
urls = [idx.host + "/dac/" + str(f) for f in idx.search['file']]
Method 1:
%%time
fs = fsspec.filesystem("http")
out = fs.cat(urls) # fetches data concurrently
results = []
for url in out:
results.append(xr.open_dataset(out[url]))
>>> CPU times: user 1.2 s, sys: 240 ms, total: 1.44 s
>>> Wall time: 6.95 s
Method 2:
%%time
results = httpstore().open_mfdataset(urls, concat=False)
>>> CPU times: user 1.52 s, sys: 255 ms, total: 1.78 s
>>> Wall time: 5.3 s
what's taking time is the creation of xarray dataset, not data download,
so may be this is not where to search for performance improvement
This may be a design already implemented in the
test_data
CLI used to populate CI tests data in mocked http servers.However, I wonder if we should do this when fetching a large amount of file from one of the GDAC servers (https and s3) ?
The fsspec http store is already asynchronous but I don't quite understand how is parallelisation implemented for multi-files download:
Our current option is to possibly use multithreading with the
parallel
option of the datafetcher, that is in httpstore.open_mfdataset. With this design, we apply pre/post-processing of Argo data on chunks in parallel, but that is different from downloading in parallel, then processing in parallel (possibly with another mechanism)eg: https://stackoverflow.com/questions/57126286/fastest-parallel-requests-in-python