allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 133 forks source link

Add a YAML file for provisioning a GPU Kubernetes machine. #379

Closed schmmd closed 7 years ago

schmmd commented 7 years ago

I have been using this YAML file to create a development environment for deep_qa on our Kubernetes cluster.

schmmd commented 7 years ago

@DeNeutoy did you come up with a different solution here? I know you and Chloe were working on this yesterday.

matt-gardner commented 7 years ago

What do you mean by "development environment"?

schmmd commented 7 years ago

@matt-gardner I have used this to provision a machine on the Kubernetes cluster that uses the Dockerfile provided at the root to run the application on a machine with GPU support properly configured. I think you and Mark N use AMIs to get such an environment--are you able to point me to any scripts that define how those AMIs are built?

The Kubernetes YAML file could be tweaked slightly to also run jobs on the kubernetes cluster.

I'd like to modify the README to add some instructions on how to run an example application, but I have not been entirely successful doing that yet--I'm eager for https://github.com/allenai/deep_qa/pull/377 to merge.

DeNeutoy commented 7 years ago

This looks good, there were a few gotchas with running stuff on Kube that I wasn't aware of, which I think @ca16 is addressing in the infrastructure readmes.

I think it would be good for this to be added to the run_on_aws.sh script or an alternative run_on_kube.sh script, with the yaml file being created on the fly with the user only providing the required fields, eg name=markn, num_gpus=25, job_name=my-great-job, with the docker image being generated the way it currently is in run_on_aws.sh. Is there a way to supply the yaml to kube as series of command line arguments? Reading in this default yaml file, adding some arguments defined by the user and then running would probably be a good workflow for running experiments.

schmmd commented 7 years ago

I think it would be good for this to be added to the run_on_aws.sh script or an alternative run_on_kube.sh script

I agree--and I think Aristo is planning on providing run_on_kube.sh and deprecating run_on_aws.sh. I just wanted to share what I had so far so we could build on it to make such a script.

Is there a way to supply the yaml to kube as series of command line arguments?

There is (example: kubectl run -i --tty busybox --image=busybox --restart=Never -- sh)--but it seems to have some limitations and I wasn't able to figure out how to pass some configuration. @jkinkead would know more--I'm certainly no expert. I think his plan ultimately is to create python scripts that generate kubernetes yamls.

jkinkead commented 7 years ago

There's a lot of configuration that can only be passed through via files. You can use JSON instead of YAML, but since that doesn't support comments, it's not really recommended.

DeNeutoy commented 7 years ago

I guess this could be done fairly easily in the meantime using sed or something, given we only want to replace a few variables. We could copy a default yaml file with obvious identifiers like NUM_GPUS, replace these with user provided values and then run it.

jkinkead commented 7 years ago

Or python:

TEMPLATE="""
yaml:
  - user: {user}
  - numGpus: {numGpus}
"""

print(TEMPLATE.format(user='markn', numGpus=2))
matt-gardner commented 7 years ago

Ok, yeah, this looks fine to me. I was confused when you said "development environment", because I use that to mean a place I write code, and I was confused about writing code on the kubernetes cluster. I agree with Mark's comments.

schmmd commented 7 years ago

I'm not sure what your merge etiquette is, but since this cannot have a negative impact I'll go ahead and merge it based on your "looks fine to me".

I might have been a bit confused when I used my term "developer environment" because it took me a while to get anything in this project to run, but I'm having a lot of luck with Mark N's recent PR.

matt-gardner commented 7 years ago

The way I've been doing merge etiquette - if a PR is approved and passes CI, it's good to merge. I'll often approve a PR with a few easy things to fix, if I trust the person to actually fix those things before merging. I didn't actually approve the PR here because I thought you were going to add a script like Mark and Jesse were talking about, but it's just fine to do that in another PR later (especially as I notice your note that aristo-eng was probably going to do this later anyway), so merging is fine here.

schmmd commented 7 years ago

@matt-gardner thanks, I'll wait for the explicit reviewer approval next time.