Implement ability to discard intermediate data

rhliang commented 9 years ago

If we're going to batch jobs, we better be able to clean up the system alongside it! We're starting to feel the space crunch on the cluster right now.

[x] Pipeline serializer should represent outputs_to_delete as either a list of dataset indices or dataset names (Josh, Richard, and James all prefer names)
[x] Pipeline serializer in the backend should accept dataset names as valid form data
[x] Enhance the step creation/editing dialogue in the pipeline assembly page to specify whether for each output the intermediate data should be kept
[x] ~~On the View Run page, represent discarded data somehow: perhaps the cable is shaded in a colour other than green or in a different shade of green~~ Moved to new issue: #434.
[x] On the View Results page, represent discarded data somehow: perhaps a piece of text that says "discarded" where the View/Download links would ordinarily be
Updated Description

We decided to simplify the strategy for this issue.

[x] Figure out how to purge a single dataset. Just delete the dataset record?
[x] Display a removed dataset as "removed" instead of "redacted".
[x] If we decide that the step has to be rerun, create a new exec record.
[x] Configure upper and lower limits on storage. When upper limit is exceeded, delete output datasets until you go below the lower limit. Delete oldest datasets first.
[x] Fix RunOutputsSerializer.get_input_summary() when inputs have been purged.
[x] Check that a dataset is not being used as an input for an active run before purging it.
[x] Check that orphaned files are at least an hour old before deleting them.
[x] Walk more of the pipeline to check whether the purged dataset is required. We will need to walk at least until we hit an exec record that has all of its outputs.
[x] Create another issue for a more sophisticated strategy of dataset clean up, maybe configurable by the user, maybe automated based on data size, age, and access patterns. Also consider purging log files. Maybe combine limits for sandboxes, datasets, and log files. Break up purge over multiple polling cycles. Add a button on the run results page to rerun. (Created issue #434.)

jamesnakagawa commented 9 years ago

Did a part of this today. For the first point, it's currently being passed as dataset names. Since the backend rejects these I created a new branch since functionality is temporarily broken.

ArtPoon commented 9 years ago

Since an interface for deleting data files may become extremely complicated (selecting data files associated with specific pipelines, versions), the most feasible approach may be to define global scope criteria such as the creation date of intermediate data files for all pipelines, or the number of times a data set has been accessed. Continue discussion offline for now.

donkirkby commented 9 years ago

After some design discussion, this is the strategy:

Purge is triggered by the fleet manager. After polling for new tasks, check the total storage used and trigger a purge, if needed.
Define two thresholds, max storage and target storage. If max is exceeded, delete the oldest output datasets until total storage drops below target.

Deleting a dataset is causing some problems when we search for exec records. I'll try looking at how we handle PipelineStep.outputs_to_delete.

jamesnakagawa commented 9 years ago

Just refreshed myself on where I left the ImplementOutputsToDelete branch. I think I'm just waiting for the backend to accept the new form data. Is someone available to help bring that up to speed?

rhliang commented 9 years ago

Sure I can help with that.

cfe-lab / Kive

Implement ability to discard intermediate data #413

Updated Description