ai2cm / fv3config

Manipulate FV3GFS run directories
Apache License 2.0
1 stars 0 forks source link

Make the fv3run functions lazy #60

Closed nbren12 closed 3 years ago

nbren12 commented 4 years ago

The 3 fv3run functions each have different behaviors when it comes to waiting for the job to complete.

  1. run_native, run_docker will block until execution is complete
  2. run_k8s will not block

Behavior #1 is not the most convenient use for interactive use e.g. performing multiple run_docker runs within a single jupyter notebook. In this context, it would be more flexibile to return a subprcesss.Popen object and allow the user to wait for it complete or not.

Behavior #2 is great for one-off jobs, but it would be nice to be able to wait for completion e.g. with the one_step jobs.

To resolve this difference in behavior, I think all the run_ functions should immediately return some sort of "promise" object that can be polled for completion, or be used to interact with the job in some simple way. There are many implementation strategies I can think of

generator = run_native(config) # returns immediately next(generator) # blocks until check_call completes.


- use some kind of "promise" object. e.g. `dask.delayed`, or python [Future]( https://docs.python.org/3/library/asyncio-future.html)
 For `run_native` and `run_docker` we could simply return the `subprocess.Popen` object. Then for `run_kuberentes` we could implement an object with the same interace as `subprocess.Popen`. This interface would have methods for waiting and polling for completion, and possible interacting with the logging streams. 

@brianhenn and @frodre might be interested in this for the orchestration.
mcgibbon commented 4 years ago

Overall lazy submission is a good idea. Two main thoughts on this:

This is a change that would break any workflow currently calling run_native or run_docker. The ways around that are either to update all of those calls, or to implement this in a backwards-compatible way. For example, we could add submit_native and submit_docker routines with this behavior, rename run_kubernetes to submit_kubernetes, and have run_kubernetes call submit_kubernetes with a deprecation warning. We'd remove run_kubernetes at a later time. It may be worth keeping the other run commands around for convenience.

I don't know how you'll implement this for kubernetes. I looked at it when I was writing run_kubernetes, and at the time decided job monitoring was better left to the Google Cloud Console because the python tools for it seemed either hard to use or not there. It could be worth another look.

It has no bearing on whether to do these python interface changes, but I wouldn't change the fv3run command line behavior.

mcgibbon commented 4 years ago

Even if we plan to fully remove the run_ routines and move over to this new behavior, I'd still suggest moving over to new names and having a transition period where calling the old routines gives a deprecation warning.

nbren12 commented 4 years ago

I like the idea of changing the names.

As far as implementation goes, Stephen daspit implemented a loop and check for completion approach in his original prototyping of the one_step pipeline: https://github.com/VulcanClimateModeling/fv3net/blob/50548c69ece531d917c9420aa8cec4852cea0574/workflows/rerun-fv3/scale.sh#L18

mcgibbon commented 4 years ago

I guess the approach then could be to submit the job as before, and then return a subprocess object that has called a bash script waiting to return based on this kubectl command? That could work. It might need some modification to be able to give an error if the job fails. The issue I ran into was there were no inside-python tools for this, I hadn't thought of calling kubectl.

nbren12 commented 4 years ago

Huh, I am surprised there is no kubernetes API object for this. Since the python k8s api simply resolves everything to HTTP requests to the k8s server, I think we could use requests in python to query for completion. As usual, the main problem is authentication, but hopefully we can leverage the k8s python library for that.

mcgibbon commented 4 years ago

I was also surprised I didn't find something in the kubernetes package for this. There might be something I missed. I think it's somewhat telling though that e.g. this kubernetes monitoring app written in flask/python uses kubectl calls instead of the kubernetes package.

nbren12 commented 4 years ago

Okay. I figured out how to do this with the k8s python api. The batch_v1 api has a read_namespaced_job method that can get the run status. I wrote an exmaple here: https://gist.github.com/nbren12/7196d56f782947d454cbdd676f1ea8de

mcgibbon commented 4 years ago

Nice!

nbren12 commented 3 years ago

We discussed removing this functionality altogether, so I will close the issue.