COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
58 stars 25 forks source link

Show progress when loading data #196

Open aidanheerdegen opened 4 years ago

aidanheerdegen commented 4 years ago

For some datasets getvar can take quite a long time, some minutes. It would help to give users some indication of the progress so they know how long they can expect it to take.

angus-g commented 4 years ago

The slow bit is probably xarray's coordinate compatibility checking (you could test this by passing compat="override" to getvar()). If we're more sure that grids match for a given query, we could be less conservative and pass this option by default. In my opinion, getvar shouldn't ever be a long enough operation that it requires progress indication.

aidanheerdegen commented 4 years ago

I gave this a crack. Doesn't seem to make much difference:

%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session, 
                    start_time='1958-01-02 00:00:00', end_time='2019-01-01 00:00:00', frequency='1 daily')
CPU times: user 36.4 s, sys: 40.2 s, total: 1min 16s
Wall time: 38.1 s

%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session, 
                    start_time='1958-01-02 00:00:00', end_time='2019-01-01 00:00:00', 
                   frequency='1 daily', compat='override', coords='minimal')
CPU times: user 32.5 s, sys: 34.8 s, total: 1min 7s
Wall time: 33.6 s

c.f.

%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session, 
                    start_time='1958-01-02 00:00:00', end_time='1959-01-01 00:00:00', frequency='1 daily')
CPU times: user 639 ms, sys: 398 ms, total: 1.04 s
Wall time: 1.12 s

%%time
cc.querying.getvar('01deg_jra55v140_iaf', 'salt_global_ave', session, 
                    start_time='1958-01-02 00:00:00', end_time='1969-01-01 00:00:00', frequency='1 daily')
CPU times: user 6.32 s, sys: 6.41 s, total: 12.7 s
Wall time: 6.49 s

Note that the default coords='different' throws an error. I also tried coords='all' but it didn't make any difference in this case.

angus-g commented 4 years ago

Okay this actually seems to be scheduler-related. Without a distributed scheduler (i.e. creating a Client), it seems faster to use parallel=False on the open_mfdataset call in getvar() (can't pass this through kwargs at the moment) — in this case I got about 12 seconds for the first query. I don't seem to be able to use the threaded scheduler on a login node without just segfaulting, but using parallel=True with the multiprocessing scheduler:

with dask.config.set(scheduler="processes"):
   ...

It takes something like 90 seconds, so there's probably some kind of serialisation/lock contention/GIL/alien overhead here...

On the other hand, with a modest distributed scheduler:

c = Client(n_workers=8, threads_per_worker=1)

the original query takes about 3-5s, so it can speed up a little bit over the purely serial approach.

Having said that, this is a scalar variable, so it's really just hitting the file opening and serialisation stuff. Opening the temp variable from the same dataset, it's about 5s as well with coords="minimal", compat="override" (perhaps as expected, this should basically be the same as the global average). It's only about a 2s increase to do the coordinate merging.

Takeaways

We could possibly put some kind of progress bar into the load through the preprocess mechanism that we use to load only a single variable out of each dataset, which I believe already saves us a lot of time. I'm not sure how this would interact with a parallel=True load, however, because the preprocessing then goes through a future...

We might be able to save some time on loading by passing parallel=True only when we have a distributed scheduler available. Shared memory schedulers seem to benefit from parallel=False.

I think we can do this check as:

try:
    kwargs["parallel"] = dask.config.get("scheduler") == "dask.distributed"
except KeyError:
    pass # no distributed scheduler available
aidanheerdegen commented 4 years ago

Nice detective work @angus-g

It is such a shame that it is necessary. sigh

aidanheerdegen commented 4 years ago

Another example that seems to just fail to load

cc.querying.getvar('01deg_jra55v140_iaf', 'frazil', session,
                    start_time='1976-04-26 00:00:00', end_time='1978-06-28 00:00:00',
                    frequency='1 daily')

it is trying to load 794 files for a pretty modest time range:

len(cc.querying._ncfiles_for_variable('01deg_jra55v140_iaf', 'frazil', session,
                    start_time='1976-04-26 00:00:00', end_time='1978-06-28 00:00:0',
                    frequency='1 daily'))
794

because the daily ice data is still in individual files. @aekiss I thought the daily ice data was being auto collated these days?

If you tried to load the entire time series it is ... gulp ... 22280 files ...

angus-g commented 4 years ago

I reckon this second example is one where we need to use coords="minimal", compat="override". We don't need to pay the cost of reading and comparing the coordinates across all the files, especially because we're pretty confident that we've matched files that correspond to a single diagnostic, on a single grid, for a single time frequency.

For the 794 file selection, on an 8 process distributed scheduler, it took 13s wall time to load. I doubt the full timeseries would load in a reasonable amount of time, however. 22000+ files is maybe stretching it a little bit (gets to around 6 min if you just scale it out). I'm not sure it would be a very ergonomic series to work with anyway, since any operation would have to open a whole pile of files. It does sound like a good candidate to be collated...

aekiss commented 4 years ago

The daily ice files have not been collated in 01deg_jra55v140_iaf - I agree this would be a good thing to do, I guess with this script? https://github.com/aidanheerdegen/cleancosima/blob/master/aggregateiceh.sh

aidanheerdegen commented 4 years ago

Seems reasonable. I think that was just the commands we used once and I chucked them in a repo.

aekiss commented 4 years ago

I've now collated the daily ice outputs on 01deg_jra55v140_iaf (they are iceh.????-??-daily.nc) - will they be automatically indexed overnight?

Also I've retained the original uncollated dailies (iceh.????-??-??.nc) but will delete them after I've let the users know. When I do this will they be automatically removed from the DB by the overnight indexing?

aidanheerdegen commented 4 years ago

The new files will be indexed overnight, but leaving the unconcatenated copies in the same location will likely cause issues with queries as there is identical data but in different files.

When the files are removed they will be deleted from the database.

aekiss commented 4 years ago

@aidanheerdegen in that case I think I'll rename iceh.????-??-??.nc to something like iceh.????-??-??.nc-DEPRECATED until I'm sure they can be deleted - that change to the file extension will remove them from the DB, right?

aidanheerdegen commented 4 years ago

Yeah that will stop them being indexed, but if you're changing their name I don't see why you wouldn't just delete them (or move them to a temp directory somewhere to delete in a week or so once you're happy the collated data is ok).

If you rename them that'll stuff up anyone who is using the existing file path, in which case they can change to the collated files.

aekiss commented 4 years ago

sync_data.sh in the latest configs now automatically concatenates daily ice data into monthly files before syncing to the destination (typically ik11 or cj50) where it will be indexed into the database.