OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
8 stars 12 forks source link

CLI: cache outputs & Fuzzy Starts #645

Closed josephjclark closed 2 months ago

josephjclark commented 3 months ago

Short Description

This PR:

Related issue

Fixes #375

Docs

PR open at https://github.com/OpenFn/docs/pull/467

Caching

If the --cache-steps flag is passed to the CLI, every step will write its output to a local folder as a JSON file.

Caching is useful for debugging, but the real benefit comes from re-running a workflow from a fixed point.

If the --start flag is passed, the CLI will automatically load the correct input state from the cache for that step. This is is clearly logged.

Note that --cache-steps and --start are mutually exclusive. If you run --start without --cache-steps then the cached output will NOT be updated.

To work out the right input state, the CLI must find upstream step of the start job. At the moment, this is easy because in a workflow, each step can only have one input. If that rule ever changes, this part of the feature becomes more complex (I've added a comment in workflow validation to this effect).

The cache is written to a folder called .cli-cache adjacent to the workflow or job file. A sub folder will be created with the workflow name (which defaults to the file name), and a json file will be created for each step (with the step id).

So for .tmp/workflow.json you'll get a cache path something like ./.cli-cache/workflow/step-1.json.

Caching is off by default.

Setting the OPENFN_ALWAYS_CACHE_STEPS env var will default step caching ON. Disable by passing --no-cache-steps.

Fuzzy Starts

The PR also enables "fuzzy" start points to be defined.

This behaviour exists outside of the caching stuff, but the caching stuff benefits from it because it's easier to specify a start node now.

The idea here is that a) you may want to use the step name or id as the start point, depending on what your workflow json looks like; and b) if you're using a project downloaded from Lightning, the ids are a nightmare to type

TODO

Checklist before requesting a review

josephjclark commented 3 months ago

There's something funny I think in state assembly in the data key. I will investigate

josephjclark commented 3 months ago

Possible feature: generate a .gitignore file as a silbing of cli-cache to stop it getting checked in.

We should log that we are doing this

What if gitignore exists? maybe we could warn to ignore?

Is there a first class way we can do something like "git ignore .cli-cache" and let git work out where to write the path to?

josephjclark commented 3 months ago

Maybe --cache and --start are different and mutually exclusive.

--cache means "overwrite the existing cache with the last outputs"

--start means "run this workflow from this step. Oh, and unless I say otherwise, use my cached input state please.

In this model, running with --start would not override the cached data. So you can re-run a broken workflow without corrupting your cached output.

Of course, if you do --start --cache then your cached output will be overridden.

josephjclark commented 3 months ago

I am not going to implement a clear cache command. There's too much low level complexity associated with it.

If you do openfn clear-cache, it's not clear which cache. The repo cache? The metadata cache? The job cache? It's no good.

We could take a path to the cache. But is that a path to .cli-cache? The parent folder? The target workflow? Do you want to clear for all workflows or just for one workflow? Basically the path feels so ambiguous and unintuitive, it's not clear to me what it'll do.

Ok, so we can confirm the path before we do anything, but it's still an annoying command. Isn't rm -rf just easier anyway? We all know what that does.

The other option is openfn workflow.json --clear-cache, which will clear the cache associated with that workflow. But it's weird because it won't run the workflow. It's even weirder if you do the long form of openfn execute workflow.json --clear-cache. It also doesn't help you clear the cache for all workflows.

Maybe we'll come back to it later - right now it feels super high complexity for incredibly low value.

josephjclark commented 2 months ago

Comments on using name vs id for start:

Name or ID

The cache doesn't really need to be human readable, it's designed to be automated. If you want to see the output from a step, you can see it in the job output and should be able to ctrl-click it open.

Maybe we should always use step ids to save or write to the cache, because we can guarantee an id is present, safe, and lightning-compatible. But when you specify "start", we should try to:

Any fuzzy match would throw if the input is ambiguous.

josephjclark commented 2 months ago

Just a heads up @mtuchi, I've renamed --cache to --cache-steps

josephjclark commented 2 months ago

I think we're done here!

It's not particularly well tested, but there are a few basic tests. I think it's a reasonable place to leave it.

josephjclark commented 2 months ago

Hi @mtuchi, when you've got a spare half hour can you take another look at this, maybe run a couple of tests? Thanks!

mtuchi commented 2 months ago

@josephjclark i have tested the --cache-steps and --start step-id. Everything work as expected I have also tested openfn workflow.json --clear-cache. It didn't work, I don't think we need it honestly. It's very easy to clear .cli-cache if i needed

josephjclark commented 2 months ago

Hi @mtuchi - last one I hope!

This PR now supports --end and --only

I plan to merge and release this in the morning. Let me know if you get a chance to test it out

josephjclark commented 2 months ago

Investigate: if no start is passed, don't try and load cached state

Update: Yes, this a problem. I need to re-think the way I treat the start/end/only flags and how they feed into the options object.

josephjclark commented 2 months ago

TODO (commenting before I forget) - update the readme here. Update the docs PR too