Open aidanheerdegen opened 4 years ago
The slow bit is probably xarray's coordinate compatibility checking (you could test this by passing compat="override"
to getvar()
). If we're more sure that grids match for a given query, we could be less conservative and pass this option by default. In my opinion, getvar
shouldn't ever be a long enough operation that it requires progress indication.
I gave this a crack. Doesn't seem to make much difference:
%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session,
start_time='1958-01-02 00:00:00', end_time='2019-01-01 00:00:00', frequency='1 daily')
CPU times: user 36.4 s, sys: 40.2 s, total: 1min 16s
Wall time: 38.1 s
%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session,
start_time='1958-01-02 00:00:00', end_time='2019-01-01 00:00:00',
frequency='1 daily', compat='override', coords='minimal')
CPU times: user 32.5 s, sys: 34.8 s, total: 1min 7s
Wall time: 33.6 s
c.f.
%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session,
start_time='1958-01-02 00:00:00', end_time='1959-01-01 00:00:00', frequency='1 daily')
CPU times: user 639 ms, sys: 398 ms, total: 1.04 s
Wall time: 1.12 s
%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session,
start_time='1958-01-02 00:00:00', end_time='1969-01-01 00:00:00', frequency='1 daily')
CPU times: user 6.32 s, sys: 6.41 s, total: 12.7 s
Wall time: 6.49 s
Note that the default coords='different'
throws an error. I also tried coords='all'
but it didn't make any difference in this case.
Okay this actually seems to be scheduler-related. Without a distributed scheduler (i.e. creating a Client
), it seems faster to use parallel=False
on the open_mfdataset
call in getvar()
(can't pass this through kwargs
at the moment) — in this case I got about 12 seconds for the first query. I don't seem to be able to use the threaded scheduler on a login node without just segfaulting, but using parallel=True
with the multiprocessing scheduler:
with dask.config.set(scheduler="processes"):
...
It takes something like 90 seconds, so there's probably some kind of serialisation/lock contention/GIL/alien overhead here...
On the other hand, with a modest distributed scheduler:
c = Client(n_workers=8, threads_per_worker=1)
the original query takes about 3-5s, so it can speed up a little bit over the purely serial approach.
Having said that, this is a scalar variable, so it's really just hitting the file opening and serialisation stuff. Opening the temp
variable from the same dataset, it's about 5s as well with coords="minimal", compat="override"
(perhaps as expected, this should basically be the same as the global average). It's only about a 2s increase to do the coordinate merging.
We could possibly put some kind of progress bar into the load through the preprocess
mechanism that we use to load only a single variable out of each dataset, which I believe already saves us a lot of time. I'm not sure how this would interact with a parallel=True
load, however, because the preprocessing then goes through a future...
We might be able to save some time on loading by passing parallel=True
only when we have a distributed scheduler available. Shared memory schedulers seem to benefit from parallel=False
.
I think we can do this check as:
try:
kwargs["parallel"] = dask.config.get("scheduler") == "dask.distributed"
except KeyError:
pass # no distributed scheduler available
Nice detective work @angus-g
It is such a shame that it is necessary. sigh
Another example that seems to just fail to load
cc.querying.getvar('01deg_jra55v140_iaf', 'frazil', session,
start_time='1976-04-26 00:00:00', end_time='1978-06-28 00:00:00',
frequency='1 daily')
it is trying to load 794 files for a pretty modest time range:
len(cc.querying._ncfiles_for_variable('01deg_jra55v140_iaf', 'frazil', session,
start_time='1976-04-26 00:00:00', end_time='1978-06-28 00:00:0',
frequency='1 daily'))
794
because the daily ice data is still in individual files. @aekiss I thought the daily ice data was being auto collated these days?
If you tried to load the entire time series it is ... gulp ... 22280 files ...
I reckon this second example is one where we need to use coords="minimal", compat="override"
. We don't need to pay the cost of reading and comparing the coordinates across all the files, especially because we're pretty confident that we've matched files that correspond to a single diagnostic, on a single grid, for a single time frequency.
For the 794 file selection, on an 8 process distributed scheduler, it took 13s wall time to load. I doubt the full timeseries would load in a reasonable amount of time, however. 22000+ files is maybe stretching it a little bit (gets to around 6 min if you just scale it out). I'm not sure it would be a very ergonomic series to work with anyway, since any operation would have to open a whole pile of files. It does sound like a good candidate to be collated...
The daily ice files have not been collated in 01deg_jra55v140_iaf
- I agree this would be a good thing to do, I guess with this script? https://github.com/aidanheerdegen/cleancosima/blob/master/aggregateiceh.sh
Seems reasonable. I think that was just the commands we used once and I chucked them in a repo.
I've now collated the daily ice outputs on 01deg_jra55v140_iaf
(they are iceh.????-??-daily.nc
) - will they be automatically indexed overnight?
Also I've retained the original uncollated dailies (iceh.????-??-??.nc
) but will delete them after I've let the users know. When I do this will they be automatically removed from the DB by the overnight indexing?
The new files will be indexed overnight, but leaving the unconcatenated copies in the same location will likely cause issues with queries as there is identical data but in different files.
When the files are removed they will be deleted from the database.
@aidanheerdegen in that case I think I'll rename iceh.????-??-??.nc
to something like iceh.????-??-??.nc-DEPRECATED
until I'm sure they can be deleted - that change to the file extension will remove them from the DB, right?
Yeah that will stop them being indexed, but if you're changing their name I don't see why you wouldn't just delete them (or move them to a temp directory somewhere to delete in a week or so once you're happy the collated data is ok).
If you rename them that'll stuff up anyone who is using the existing file path, in which case they can change to the collated files.
sync_data.sh in the latest configs now automatically concatenates daily ice data into monthly files before syncing to the destination (typically ik11
or cj50
) where it will be indexed into the database.
For some datasets
getvar
can take quite a long time, some minutes. It would help to give users some indication of the progress so they know how long they can expect it to take.