Cannot get a multi-step example workflow to work

davidc-donorschoose commented 7 years ago

I've cloned your project and built the master branch with the golang:1.8 image.

I can install the workflow-controller and it creates the ThirdPartyResource just fine.

Then I can create the one-step examples/hello_workflow just fine, and it runs and completes. That seems perfect. However, after I use "kubectl delete workflow/hello-workflow", a Job and Pod are left behind, so I need to find the job name and "kubectl delete job/hello-workflow-rwdjz" before I can use "kubectl create" to run another instance of the one-step examples/hello_workflow.

Unfortunately, I cannot get the examples/two_steps to run at all. It does create the Workflow, but the jobs are never created. When I use "kubectl describe ...", there is no Status section shown.

sdminonne commented 7 years ago

@davidc-donorschoose first of all many thanks for your time. Version with ThirdPartyResource should be considered obsoleted especially if you work with Kube 1.7 or the latest version. I've just submitted a PR #3 which is under review (my colleagues are having a look and we plan to merge in the next days). The new workflow-controller will won't work on ThirdPartyResouces (currently deprecated in Kube) but it will work on CustomResource (CRD).

Concerning your problems (hope I understand correctly)

The Pods and Jobs left behind should be in a completed state, with generated names, you should see them getting all the pods and the jobs
If you delete a workflow all the associated should be removed as well, so for the pods linked to the jobs (job-controller should get rid of them)
The fact you cannot run the two_steps: That sounds like a bug. I'm ok to try to investigate these in the ThirdPartyResource version but if you could give a try to the CRD version it may help. Thanks for your time

davidc-donorschoose commented 7 years ago

Thanks. I will try the CustomResourceDefinition version. I noticed you had created a pull request for that already. Should I use the workflow_CRD branch?

Also, I found that all the problem may be related to the stuff left behind, and my attempt to manually clean up. I started over earlier this morning, deleting the workflow-controller and recreating it (but still on the old version of workflow-controller). Then I was able to run both examples (hello_workflow and two_steps) just fine. When I deleted them, their jobs were left behind.

(I tried to change the issue title to reflect the new problem, but couldn't.)

sdminonne commented 7 years ago

You should use the CRD version using workflow_CRD branch. Don't hesitate to create new issues and let us know in case you face new troubles again. Thanks again

davidc-donorschoose commented 7 years ago

With the workflow_CRD branch, deleting the workflows still leaves behind their jobs. I also notice that after I delete the workflow-controller, it leaves behind its CRD:

$ kubectl delete deployment/workflow-controller
deployment "workflow-controller" deleted
$ kubectl get crd
NAME                        KIND
workflows.dag.example.com   CustomResourceDefinition.v1beta1.apiextensions.k8s.io

I increased logging to --v=5 when I recreated the workflow-controller deployment. When I created the one-step hello-workflow (this was the yaml on the old master branch), the workflow-controller logged these on stderr:

9/20/2017 10:48:30 AME0920 14:48:30.317772       1 controller.go:525] unable to update Workflow:Operation cannot be fulfilled on workflows.dag.example.com "hello-workflow": the object has been modified; please apply your changes to the latest version and try again
9/20/2017 10:48:30 AME0920 14:48:30.318299       1 controller.go:204] Error syncing workflow: unable to update Workflow:Operation cannot be fulfilled on workflows.dag.example.com "hello-workflow": the object has been modified; please apply your changes to the latest version and try again

Right after those log messages appeared, the job and its pod came up and ran to completion. When I deleted that workflow, it did leave its job, which I had to delete manually.

This could definitely be a problem in the job-controller (which you mentioned). This is my stack:

rancher-v1.6.7 (this installs Kubernetes and an AWS cloud provider)
kubernetes-v1.7.2-rancher7 (bundled by rancher running K8s in containers)
docker-1.12.6
ubuntu-4.4.0-1028-aws

Does anything about my platform make you suspicious? Am I too close to the bleeding edge? I do plan to upgrade to the next stable rancher release, which I expect will be very soon: rancher-v1.6.10 on kubernetes-v1.7.4. That may also be an upgrade to the bundled etcd from etcd:v2.3.7-13 -> v3, and perhaps other components.

sdminonne commented 7 years ago

Problems like please apply your changes to the latest version and try again are not real problems, default logging (v=2) doesn't show them. All kube controller shows up this kind of error. workflow-controller does the same (see for example https://github.com/kubernetes/kubernetes/issues/28149) Your env looks good to me. Unluckily I know very few things about rancher but I don't think you're too close the bleeding edge. :) When you delete your workflow the jobs should be removed... but it's a pretty green code and I'm going to have a look in the next hours. I'm going to use hello-workflow as example and let you know. Thanks again for your time.

Amending: even v=2 show them up... But are not real problems anyway :)

davidc-donorschoose commented 7 years ago

And thanks for all your work on the workflow-controller. I read all the threads in the kubernetes issues earlier this year and am excited to start using it. We are trying to migrate some batch jobs into kubernetes containers, and your workflow-controller is very promising.

Reply if you want me to do further testing of whatever versions or examples you need. Or any other information you want me to extract about my environment.

davidc-donorschoose commented 7 years ago

Hmm. I didn't create a ServiceAccount to deploy my workflow-controller. Perhaps that is necessary to give full access to something in Kubernetes? Let me know.

sdminonne commented 7 years ago

It depends. If you run workflow-controller in a pod you need a service account (and need to refernce the service account in the workflow-controller pod spec). If you run it using $ ./workflow-controller --kubeconfig=$HOME/.kube/config you don't need it. Since workflow-controller is going to use the access token you got in your .kube/config. It means the same used by kubectl. Didn't have time to retest it, I plan to do in the next days. Sorry I'm pretty busy on other tasks :(

sdminonne commented 6 years ago

@davidc-donorschoose re-reading this issue there were plenty of topics here.

ThirdPartyResources (superseded by CR)
Jobs orphaned by deleted Workflow (fixed by @alexei-led in #19)
Removing finished jobs for finished workflow (not an issue)

I'm going to close this, if for whatever reason you disagree don't hesitate to open another one (eventually wiith an enhacement label).

Thanks for your work on this.

AmadeusITGroup / workflow-controller

Cannot get a multi-step example workflow to work #4