Closed josephjclark closed 2 months ago
There's something funny I think in state assembly in the data key. I will investigate
Possible feature: generate a .gitignore file as a silbing of cli-cache to stop it getting checked in.
We should log that we are doing this
What if gitignore exists? maybe we could warn to ignore?
Is there a first class way we can do something like "git ignore .cli-cache" and let git work out where to write the path to?
Maybe --cache
and --start
are different and mutually exclusive.
--cache
means "overwrite the existing cache with the last outputs"
--start
means "run this workflow from this step. Oh, and unless I say otherwise, use my cached input state please.
In this model, running with --start
would not override the cached data. So you can re-run a broken workflow without corrupting your cached output.
Of course, if you do --start --cache
then your cached output will be overridden.
I am not going to implement a clear cache command. There's too much low level complexity associated with it.
If you do openfn clear-cache
, it's not clear which cache. The repo cache? The metadata cache? The job cache? It's no good.
We could take a path to the cache. But is that a path to .cli-cache? The parent folder? The target workflow? Do you want to clear for all workflows or just for one workflow? Basically the path feels so ambiguous and unintuitive, it's not clear to me what it'll do.
Ok, so we can confirm the path before we do anything, but it's still an annoying command. Isn't rm -rf
just easier
anyway? We all know what that does.
The other option is openfn workflow.json --clear-cache
, which will clear the cache associated with that workflow. But it's weird because it won't run the workflow. It's even weirder if you do the long form of openfn execute workflow.json --clear-cache
. It also doesn't help you clear the cache for all workflows.
Maybe we'll come back to it later - right now it feels super high complexity for incredibly low value.
Comments on using name vs id for start:
The cache doesn't really need to be human readable, it's designed to be automated. If you want to see the output from a step, you can see it in the job output and should be able to ctrl-click it open.
Maybe we should always use step ids to save or write to the cache, because we can guarantee an id is present, safe, and lightning-compatible. But when you specify "start", we should try to:
--start 2
would use the third item of the steps array, I guess, unless we wanted to make it 1-based)Any fuzzy match would throw if the input is ambiguous.
Just a heads up @mtuchi, I've renamed --cache
to --cache-steps
I think we're done here!
It's not particularly well tested, but there are a few basic tests. I think it's a reasonable place to leave it.
Hi @mtuchi, when you've got a spare half hour can you take another look at this, maybe run a couple of tests? Thanks!
@josephjclark i have tested the --cache-steps
and --start step-id
. Everything work as expected
I have also tested openfn workflow.json --clear-cache
. It didn't work, I don't think we need it honestly. It's very easy to clear .cli-cache
if i needed
Hi @mtuchi - last one I hope!
This PR now supports --end
and --only
I plan to merge and release this in the morning. Let me know if you get a chance to test it out
Investigate: if no start is passed, don't try and load cached state
Update: Yes, this a problem. I need to re-think the way I treat the start/end/only flags and how they feed into the options object.
TODO (commenting before I forget) - update the readme here. Update the docs PR too
Short Description
This PR:
--cache-steps
is passed--start
if input is not explicitly passed in--end
and--only
as CLI argumentsRelated issue
Fixes #375
Docs
PR open at https://github.com/OpenFn/docs/pull/467
Caching
If the
--cache-steps
flag is passed to the CLI, every step will write its output to a local folder as a JSON file.Caching is useful for debugging, but the real benefit comes from re-running a workflow from a fixed point.
If the
--start
flag is passed, the CLI will automatically load the correct input state from the cache for that step. This is is clearly logged.Note that
--cache-steps
and--start
are mutually exclusive. If you run--start
without--cache-steps
then the cached output will NOT be updated.To work out the right input state, the CLI must find upstream step of the start job. At the moment, this is easy because in a workflow, each step can only have one input. If that rule ever changes, this part of the feature becomes more complex (I've added a comment in workflow validation to this effect).
The cache is written to a folder called
.cli-cache
adjacent to the workflow or job file. A sub folder will be created with the workflow name (which defaults to the file name), and a json file will be created for each step (with the step id).So for
.tmp/workflow.json
you'll get a cache path something like./.cli-cache/workflow/step-1.json
.Caching is off by default.
Setting the
OPENFN_ALWAYS_CACHE_STEPS
env var will default step caching ON. Disable by passing--no-cache-steps
.Fuzzy Starts
The PR also enables "fuzzy" start points to be defined.
This behaviour exists outside of the caching stuff, but the caching stuff benefits from it because it's easier to specify a start node now.
The idea here is that a) you may want to use the step name or id as the start point, depending on what your workflow json looks like; and b) if you're using a project downloaded from Lightning, the ids are a nightmare to type
TODO
cli/src/util/cache.ts
Add an API to clear the cacheI am not going to do this, its actually easier for everyone torm -rf ./cli-cache
--start
to fuzzy match against step names and ids, for good UX--cache
and--start
mutually exclusive (probably)Checklist before requesting a review