fission / fission-workflows

Workflows for Fission: Fast, reliable and lightweight function composition for serverless functions
Apache License 2.0
371 stars 42 forks source link

Crash loop backoff after cancelling a workflow and reinstalling fission-workflows. #143

Open BlakeMScurr opened 6 years ago

BlakeMScurr commented 6 years ago

Hi all :)

My workflow fission-function is in long running CrashLoopBackOff after I cancelled a workflow and reinstalled fission-workflows (which I did because I thought the workflow was hanging due to the fission environment being corrupted).

$ fission fn test --name myworkflow
^C
$ helm delete fission-workflows --purge
$ helm install --wait -n fission-workflows fission-charts/fission-workflows --version 0.2.0

I noticed that the workflow function was erroring:

$ kubectl get pods --all-namespaces
fission-function   workflow-adc9c064-53d6-11e8-a99d-080027940780-klfkuk08-68clp2xb   1/2       CrashLoopBackOff   1          17s

$ kubectl logs -n fission-function workflow-adc9c064-53d6-11e8-a99d-080027940780-klfkuk08-68clp2xb -c workflow
goroutine 36 [running]:
github.com/fission/fission-workflows/pkg/types/aggregates.(*WorkflowInvocation).ApplyEvent(0x1afec8d8, 0x1b1183c0, 0x1ac26480, 0x1ac1c048)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/types/aggregates/invocation.go:95 +0x4e6
github.com/fission/fission-workflows/pkg/fes.(*SimpleProjector).project(0x9c920a0, 0x9be43a0, 0x1afec8d8, 0x1b1183c0, 0x1ad6f3c0, 0x8e7f0ef)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:33 +0xcf
github.com/fission/fission-workflows/pkg/fes.(*SimpleProjector).Project(0x9c920a0, 0x9be43a0, 0x1afec8d8, 0x1b04df10, 0x1, 0x1, 0x913ca05, 0xa)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:15 +0x5d
github.com/fission/fission-workflows/pkg/fes.Project(0x9be43a0, 0x1afec8d8, 0x1b111f10, 0x1, 0x1, 0x0, 0x0)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:8 +0x4b
github.com/fission/fission-workflows/pkg/fes.(*SubscribedCache).HandleEvent(0x1afabc50, 0x1b1183c0, 0x1, 0x1)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:167 +0x31d
github.com/fission/fission-workflows/pkg/fes.NewSubscribedCache.func1(0x9bea7e0, 0x1ac6c000, 0x1ae286c0, 0x1afabc50)
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:127 +0x21c
created by github.com/fission/fission-workflows/pkg/fes.NewSubscribedCache
        /Users/erwin/go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:114 +0x10c

Reading through the stacktrace it seems that we're trying to apply a cancel event from the event store to a nil workflow invocation from the cache. So perhaps helm delete deletes the cache but not the store, does that seem correct?

How can I manually reset the event store? I have a snapshot of a VM with fission workflows working, so this isn't a pressing issue for me, but I thought it would be worth making note of.

Kubernetes version 1.10.2 Fission version 0.6.0 Fission Workflows version 0.2.0

erwinvaneyk commented 6 years ago

Hey @BlakeMScurr, sorry I read over your questions at the end.

So perhaps helm delete deletes the cache but not the store, does that seem correct?

Correct which is by design.

How can I manually reset the event store?

kubectl -n fission get po -o name | grep nats | xargs kubectl -n fission delete

to reset workflows completely without reinstalling it, simply delete the workflows pod afterwards, as it will read the now empty store after restarting. I have this short script for it while developing:

#!/bin/bash

kubectl -n fission get po -o name | grep nats | xargs kubectl -n fission delete
kubectl -n fission-function get po -o name | grep workflow | xargs kubectl -n fission-function delete

Of course this is still a bug, as the workflow engine should not crash fail on past data/invocations. Looking at the trace, I think 0.3.0 fixes this issue.

If you get around testing it, could you verify whether this bug is still present?

ghost commented 6 years ago

@erwinvaneyk I am on the latest version of the fission (0.10) and fission workflows (0.5) and I got similar (probably the same) error when I increase the concurrency

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x8347f9]

goroutine 73 [running]:
github.com/fission/fission-workflows/pkg/types/aggregates.(*WorkflowInvocation).ApplyEvent(0xc4202feb00, 0xc4202cc000, 0x16, 0x7f3b40bd76c8)
    /go/src/github.com/fission/fission-workflows/pkg/types/aggregates/invocation.go:92 +0x349
github.com/fission/fission-workflows/pkg/fes.(*SimpleProjector).project(0x1eff3a0, 0x16655c0, 0xc4202feb00, 0xc4202cc000, 0xc420638b58, 0x835a8a)
    /go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:33 +0xe7
github.com/fission/fission-workflows/pkg/fes.(*SimpleProjector).Project(0x1eff3a0, 0x16655c0, 0xc4202feb00, 0xc42054bbe8, 0x1, 0x1, 0xc4202feae0, 0x0)
    /go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:15 +0x6d
github.com/fission/fission-workflows/pkg/fes.Project(0x16655c0, 0xc4202feb00, 0xc420638be8, 0x1, 0x1, 0x0, 0x0)
    /go/src/github.com/fission/fission-workflows/pkg/fes/projectors.go:8 +0x5f
github.com/fission/fission-workflows/pkg/fes.(*SubscribedCache).ApplyEvent(0xc42061e7c0, 0xc4202cc000, 0x1, 0x1)
    /go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:218 +0x40a
github.com/fission/fission-workflows/pkg/fes.NewSubscribedCache.func1(0x16651c0, 0xc4205b8000, 0xc42042c2c0, 0xc42061e7c0)
    /go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:175 +0x276
created by github.com/fission/fission-workflows/pkg/fes.NewSubscribedCache
    /go/src/github.com/fission/fission-workflows/pkg/fes/caches.go:161 +0x16c
erwinvaneyk commented 6 years ago

@thenamly thanks for your update on this issue. From your description, your issue sounds unrelated to this issue: invalid invocations preventing recovery of the engine. Can you share a bit more details on your setup?

ghost commented 6 years ago

This happens pretty rare. From what I understand it's happening before OOM because it never happened on bigger nodes.