CLI: cache outputs & Fuzzy Starts

josephjclark commented 3 months ago

Short Description

This PR:

Writes the output of each step/job to disk if --cache-steps is passed
Uses cache data as input when running with --start if input is not explicitly passed in
Supports "fuzzy" matching on start node names (ie, pass a partial node id or name and it'll match)
Supports --end and --only as CLI arguments

Related issue

Fixes #375

Docs

PR open at https://github.com/OpenFn/docs/pull/467

Caching

If the --cache-steps flag is passed to the CLI, every step will write its output to a local folder as a JSON file.

Caching is useful for debugging, but the real benefit comes from re-running a workflow from a fixed point.

If the --start flag is passed, the CLI will automatically load the correct input state from the cache for that step. This is is clearly logged.

Note that --cache-steps and --start are mutually exclusive. If you run --start without --cache-steps then the cached output will NOT be updated.

To work out the right input state, the CLI must find upstream step of the start job. At the moment, this is easy because in a workflow, each step can only have one input. If that rule ever changes, this part of the feature becomes more complex (I've added a comment in workflow validation to this effect).

The cache is written to a folder called .cli-cache adjacent to the workflow or job file. A sub folder will be created with the workflow name (which defaults to the file name), and a json file will be created for each step (with the step id).

So for .tmp/workflow.json you'll get a cache path something like ./.cli-cache/workflow/step-1.json.

Caching is off by default.

Setting the OPENFN_ALWAYS_CACHE_STEPS env var will default step caching ON. Disable by passing --no-cache-steps.

Fuzzy Starts

The PR also enables "fuzzy" start points to be defined.

If the start node exactly matches a step id, that step will be used as the start
If the start node partially matches a step id OR NAME, that step will be used
If the start fuzzily matches multiple steps, an error will be thrown.

This behaviour exists outside of the caching stuff, but the caching stuff benefits from it because it's easier to specify a start node now.

The idea here is that a) you may want to use the step name or id as the start point, depending on what your workflow json looks like; and b) if you're using a project downloaded from Lightning, the ids are a nightmare to type

TODO

[x] Tidy typings
[x] Add a bunch of unit and integration tests
[x] Use the logger properly in cli/src/util/cache.ts
[x] Add corresponding docs (readme and docs repo)
[x] Maybe add an env var to default cache enabled (for power users)
[x] ~~Add an API to clear the cache~~ I am not going to do this, its actually easier for everyone to rm -rf ./cli-cache
[x] Consider what happens in the event of an error
[x] Consider what happens if state is falsy (does it even matter?)
[x] See "name or id?" - I think basically we need to allow --start to fuzzy match against step names and ids, for good UX
[x] Ensure the worker will always run with the cache off (this is trivial if cache is off by default, or else may need a commit on the worker)
[x] We should probably clear the cache on every run (except the start step), so that renaming steps doesn't mess things up, or older steps which didn't run don't interfere with the output.
[x] generate a .gitignore?
[x] Make --cache and --start mutually exclusive (probably)

Checklist before requesting a review

[x] I have performed a self-review of my code
[x] I have added unit tests
[ ] Changesets have been added (if there are production code changes)

josephjclark commented 3 months ago

There's something funny I think in state assembly in the data key. I will investigate

josephjclark commented 3 months ago

Possible feature: generate a .gitignore file as a silbing of cli-cache to stop it getting checked in.

We should log that we are doing this

What if gitignore exists? maybe we could warn to ignore?

Is there a first class way we can do something like "git ignore .cli-cache" and let git work out where to write the path to?

josephjclark commented 3 months ago

Maybe --cache and --start are different and mutually exclusive.

--cache means "overwrite the existing cache with the last outputs"

--start means "run this workflow from this step. Oh, and unless I say otherwise, use my cached input state please.

In this model, running with --start would not override the cached data. So you can re-run a broken workflow without corrupting your cached output.

Of course, if you do --start --cache then your cached output will be overridden.

josephjclark commented 3 months ago

I am not going to implement a clear cache command. There's too much low level complexity associated with it.

If you do openfn clear-cache, it's not clear which cache. The repo cache? The metadata cache? The job cache? It's no good.

We could take a path to the cache. But is that a path to .cli-cache? The parent folder? The target workflow? Do you want to clear for all workflows or just for one workflow? Basically the path feels so ambiguous and unintuitive, it's not clear to me what it'll do.

Ok, so we can confirm the path before we do anything, but it's still an annoying command. Isn't rm -rf just easier anyway? We all know what that does.

The other option is openfn workflow.json --clear-cache, which will clear the cache associated with that workflow. But it's weird because it won't run the workflow. It's even weirder if you do the long form of openfn execute workflow.json --clear-cache. It also doesn't help you clear the cache for all workflows.

Maybe we'll come back to it later - right now it feels super high complexity for incredibly low value.

josephjclark commented 2 months ago

Comments on using name vs id for start:

Name or ID

The cache doesn't really need to be human readable, it's designed to be automated. If you want to see the output from a step, you can see it in the job output and should be able to ctrl-click it open.

Maybe we should always use step ids to save or write to the cache, because we can guarantee an id is present, safe, and lightning-compatible. But when you specify "start", we should try to:

parse a number and use that step in the steps array ( so --start 2 would use the third item of the steps array, I guess, unless we wanted to make it 1-based)
fuzzy match the step name (ie, match the exact string within the step name) and then map that to a step id
fuzzy match the step id, so that you can easily reference lightning ids

Any fuzzy match would throw if the input is ambiguous.

josephjclark commented 2 months ago

Just a heads up @mtuchi, I've renamed --cache to --cache-steps

josephjclark commented 2 months ago

I think we're done here!

It's not particularly well tested, but there are a few basic tests. I think it's a reasonable place to leave it.

josephjclark commented 2 months ago

Hi @mtuchi, when you've got a spare half hour can you take another look at this, maybe run a couple of tests? Thanks!

mtuchi commented 2 months ago

@josephjclark i have tested the --cache-steps and --start step-id. Everything work as expected I have also tested openfn workflow.json --clear-cache. It didn't work, I don't think we need it honestly. It's very easy to clear .cli-cache if i needed

josephjclark commented 2 months ago

Hi @mtuchi - last one I hope!

This PR now supports --end and --only

I plan to merge and release this in the morning. Let me know if you get a chance to test it out

josephjclark commented 2 months ago

Investigate: if no start is passed, don't try and load cached state

Update: Yes, this a problem. I need to re-think the way I treat the start/end/only flags and how they feed into the options object.

josephjclark commented 2 months ago

TODO (commenting before I forget) - update the readme here. Update the docs PR too

OpenFn / kit