Closed nbren12 closed 3 years ago
Overall lazy submission is a good idea. Two main thoughts on this:
This is a change that would break any workflow currently calling run_native or run_docker. The ways around that are either to update all of those calls, or to implement this in a backwards-compatible way. For example, we could add submit_native
and submit_docker
routines with this behavior, rename run_kubernetes
to submit_kubernetes
, and have run_kubernetes
call submit_kubernetes
with a deprecation warning. We'd remove run_kubernetes
at a later time. It may be worth keeping the other run
commands around for convenience.
I don't know how you'll implement this for kubernetes. I looked at it when I was writing run_kubernetes, and at the time decided job monitoring was better left to the Google Cloud Console because the python tools for it seemed either hard to use or not there. It could be worth another look.
It has no bearing on whether to do these python interface changes, but I wouldn't change the fv3run
command line behavior.
Even if we plan to fully remove the run_
routines and move over to this new behavior, I'd still suggest moving over to new names and having a transition period where calling the old routines gives a deprecation warning.
I like the idea of changing the names.
As far as implementation goes, Stephen daspit implemented a loop and check for completion approach in his original prototyping of the one_step pipeline: https://github.com/VulcanClimateModeling/fv3net/blob/50548c69ece531d917c9420aa8cec4852cea0574/workflows/rerun-fv3/scale.sh#L18
I guess the approach then could be to submit the job as before, and then return a subprocess object that has called a bash script waiting to return based on this kubectl command? That could work. It might need some modification to be able to give an error if the job fails. The issue I ran into was there were no inside-python tools for this, I hadn't thought of calling kubectl.
Huh, I am surprised there is no kubernetes API object for this. Since the python k8s api simply resolves everything to HTTP requests to the k8s server, I think we could use requests
in python to query for completion. As usual, the main problem is authentication, but hopefully we can leverage the k8s python library for that.
I was also surprised I didn't find something in the kubernetes package for this. There might be something I missed. I think it's somewhat telling though that e.g. this kubernetes monitoring app written in flask/python uses kubectl calls instead of the kubernetes package.
Okay. I figured out how to do this with the k8s python api. The batch_v1
api has a read_namespaced_job
method that can get the run status. I wrote an exmaple here: https://gist.github.com/nbren12/7196d56f782947d454cbdd676f1ea8de
Nice!
We discussed removing this functionality altogether, so I will close the issue.
The 3 fv3run functions each have different behaviors when it comes to waiting for the job to complete.
Behavior #1 is not the most convenient use for interactive use e.g. performing multiple run_docker runs within a single jupyter notebook. In this context, it would be more flexibile to return a
subprcesss.Popen
object and allow the user to wait for it complete or not.Behavior #2 is great for one-off jobs, but it would be nice to be able to wait for completion e.g. with the one_step jobs.
To resolve this difference in behavior, I think all the
run_
functions should immediately return some sort of "promise" object that can be polled for completion, or be used to interact with the job in some simple way. There are many implementation strategies I can think ofrun_
be generators. For instance:generator = run_native(config) # returns immediately next(generator) # blocks until check_call completes.