cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
325 stars 90 forks source link

File housekeeping utility. #1159

Open hjoliver opened 9 years ago

hjoliver commented 9 years ago

Cylc really needs a built-in file housekeeping utility, for archiving (by copy or move) and deletion of date-time labeled files and directories older than some offset from current cycle point

The old cylc housekeeping command was removed at cylc-6 because it wasn't ISO 8601 compatible, and it had a serious deficiency that I had never got around to addressing: it was unable to match individual files below a date-time labeled directory. Aside from that it was quite nice in some respects: it was controlled by simple config files, and it performed its configured operations in parallel.

For cylc-6+ a general housekeeping utility can no longer assume a simple fixed format cycle time (see #1158). It would have to be aware of the suite's cycle point format (actually it's worse than this - a suite using cycle point format CCYY-MM-DDTHH could still choose to use filenames containing CCYYMMDDHH for compatibility with external systems, for example).

At NIWA we currently use a (very non-general) in-house shell script for housekeeping. @matthewrmshin - how is this handled at the Met Office?

hjoliver commented 9 years ago

UPDATE: a-ha, rose_arch and rose_prune! I had thought Rose could do it but at last look I was expecting a command rather than a built-in app, so I missed it. No doubt this is cylc-6 compatible. I presume this comes under the category of functionality that should be moved into cylc? (not that I'm trying to steal all your stuff!).

It looks like these built-in apps do not handle files "older than" (as opposed to "at") the cycle point offset , but that doesn't really matter. In my old utility, I was trying to automatically handle the case of changing to a smaller offset mid run. That's more difficult than matching a single specific cycle point, obviously (it requires a regex that matches any cycle point, which now depends on the format in use, or else it has to match all possible formats).

matthewrmshin commented 9 years ago

No doubt this is cylc-6 compatible.

Yes, rose_prune in the latest Rose release is tested with cylc 6.

I presume this comes under the category of functionality that should be moved into cylc?

I think most of rose_prune can move to cylc.

It is less clear to me whether we should move rose_arch into cylc or not.

matthewrmshin commented 8 years ago

In the latest version of rose_prune, we have removed any Rose specific functionality, so it is safe to say that we can migrate all its functionality across. (This is as long as we are able to provide a compatibility layer that is transparent to users. I'll follow up on this soon.)

matthewrmshin commented 8 years ago

A quick brain dump...

It should be relatively straightforward to move job logs housekeep functionality to cylc from rose_prune, which does the following:

In cylc, we can also do:

arjclark commented 8 years ago

We should also have the capability to be able to housekeep the contents of the databases as they can become overly large over time.

hjoliver commented 8 years ago

@arjclark - yes, DB housekeeping would be good. Maybe just deleting entries beyond some configurable cutoff would do.

matthewrmshin commented 8 years ago

DB tables we can housekeep:

DB tables we cannot housekeep:

DB tables we may be able to housekeep:

See also #1827.

hjoliver commented 8 years ago

[meeting] we agreed:

(need to be careful of any clash between DB and file housekeeping offsets)

hjoliver commented 8 years ago

NIWA operations reports that (at older cylc versions) db locking issues were strongly correlated with the size of the suite db (presumably because read times became significantly longer, perhaps on a slow filesystem). They used to wipe a db and restart the suite from scratch occasionally, which would fix the problem. This isn't an issue now with our robust lock recovery mechanism, but if db ops do (or can) slow significantly with db size, then automatic housekeeping would be a good thing.

arjclark commented 8 years ago

@benfitzpatrick - the above comment looks related to your rose bush timings investigations

benfitzpatrick commented 8 years ago

We think we can find at least a factor of 2 speed-up for jobs and cycle views in Rose Bush, which I assume is always the dominant reader of the public database. Bigger databases are slower...

matthewrmshin commented 5 years ago

Had a quick discussion with @dpmatthews. A lot of disk usages come from large job log files and large number of job logs per task submit. It may be worthwhile to have them house-kept more aggressively. E.g.:

oliver-sanders commented 3 years ago

Note if it is hard to keep rose prune functional after the platforms change this may have to get fast-tracked to Cylc8.