cschleiden / go-workflows

Embedded durable workflows for Golang similar to DTFx/Cadence/Temporal
https://cschleiden.github.io/go-workflows/
MIT License
233 stars 49 forks source link

Upgrade to `go-redis` v9 #177

Closed zaataylor closed 1 year ago

zaataylor commented 1 year ago

This PR updates the go-redis dependency to v9!

zaataylor commented 1 year ago

Looks like the Redis tests are failing due to a timeout issue 🤔 Not sure if the dependency upgrade is causing this, or if a re-run might yield different results 💭

cschleiden commented 1 year ago

Looks like the Redis tests are failing due to a timeout issue 🤔 Not sure if the dependency upgrade is causing this, or if a re-run might yield different results 💭

2023/03/01 00:19:03 |ERROR| error while polling for workflow task error=reading workflow instance: unmarshaling instance state: unexpected end of JSON input
    e2e.go:156: 
            Error Trace:    /home/runner/work/go-workflows/go-workflows/backend/redis/e2e.go:156
                                        /home/runner/work/go-workflows/go-workflows/backend/redis/e2e.go:588
            Error:          Received unexpected error:
                            workflow did not finish in time: workflow did not finish in specified timeout
            Test:           Test_EndToEndRedisBackend/SideEffect_Simple

That unmarshaling instance state: unexpected end of JSON input looks suspicious

zaataylor commented 1 year ago

It looks suspicious indeed! Thanks for catching this; I hadn't seen the entire error message while skimming through the failures in the logs. I plan to take a closer look at this soon and see if I can resolve it! 😄

EDIT: Taking a closer look now, I see that all the Test_EndToEndRedisBackend tests are failing when trying to read instance state when getting the next workflow task at: https://github.com/cschleiden/go-workflows/blob/d2b463129b7f70ebfb202d3efaf1e34abfa2fa5e/backend/redis/workflow.go#L81-L84

From what I see at the code which is called by the code above here: https://github.com/cschleiden/go-workflows/blob/d2b463129b7f70ebfb202d3efaf1e34abfa2fa5e/backend/redis/instance.go#L181-L197 it seems that somehow, the Redis pipeline GET command is not returning an error, but the result of the command is invalid JSON, most likely the empty string "". I've been able to reproduce this exactly one time locally, and I'm trying to figure out how to reproduce it again so that I can get more insight into what's going on.

One thing I do know that changed between v9 and v8 of go-redis is that Pipeline is no longer thread safe, per what I see at: https://github.com/redis/go-redis/blob/master/CHANGELOG.md#breaking. But, I'm not sure if/how this might be contributing to what I'm seeing, as I think only one pipeline command is being executed by one thread at this time. I'm still not ✨ great ✨ at determining the best ways to examine Redis changes while the tests are running (or in general, Redis noob here), so I could be wrong about this.

zaataylor commented 1 year ago

To get a better idea of what/how many commands are being executed, I'm shortly going to push some temporary changes that add fmt.Printf logging in the read instance code so I can see what's going on. If I'm able to figure out the issue after that, I'll open up a new PR with cleaner history and those changes removed. I can convert this one back to draft while I work on that if you'd like (👍 / 👎 ).

zaataylor commented 1 year ago

Well, it looks as though for some reason, the context is being canceled in the tests while the pipeline command is being executed. Check out the partial log lines output below for one of the tests, Test_EndToEndRedisBackend/SimpleWorkflow_ExpectedHistory. Notice the failure of the second get command right after the first successful command targeting the instance:

2023-03-01T16:36:12.2433283Z 2023/03/01 16:36:12 |DEBUG| Created new workflow instance 
2023-03-01T16:36:12.2434057Z 2023/03/01 16:36:12 |DEBUG| Created workflow instance      instance_id=d39622e3-5342-4a90-b518-7c097210774d execution_id=207545da-f666-4fc9-92c7-87eb048f2235
2023-03-01T16:36:12.2435389Z        Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}]
2023-03-01T16:36:12.2435997Z        Errors while executing commands: <nil>
2023-03-01T16:36:12.2436680Z        Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: ]
2023-03-01T16:36:12.2437031Z        Errors while executing commands: context canceled
2023-03-01T16:36:12.2442377Z 2023/03/01 16:36:12 |ERROR| error while polling for workflow task error=reading workflow instance: unmarshaling instance state: unexpected end of JSON input
2023-03-01T16:36:12.2442985Z value of cmd.Result() creating the error was: 

I think because the context is canceled while the pipeline command is executing, the return value of the command is effectively empty/non-existent? This could help explain why the json.Unmarshal failure is happening.

I've included the full logs from this test below:

Test_EndToEndRedisBackend/SimpleWorkflow_ExpectedHistory Workflow Logs ``` 2023-03-01T16:36:12.2404821Z === RUN Test_EndToEndRedisBackend/SimpleWorkflow_ExpectedHistory 2023-03-01T16:36:12.2432422Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: redis: nil] 2023-03-01T16:36:12.2432984Z Errors while executing commands: redis: nil 2023-03-01T16:36:12.2433283Z 2023/03/01 16:36:12 |DEBUG| Created new workflow instance 2023-03-01T16:36:12.2434057Z 2023/03/01 16:36:12 |DEBUG| Created workflow instance instance_id=d39622e3-5342-4a90-b518-7c097210774d execution_id=207545da-f666-4fc9-92c7-87eb048f2235 2023-03-01T16:36:12.2435389Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2435997Z Errors while executing commands: 2023-03-01T16:36:12.2436680Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: ] 2023-03-01T16:36:12.2437031Z Errors while executing commands: context canceled 2023-03-01T16:36:12.2442377Z 2023/03/01 16:36:12 |ERROR| error while polling for workflow task error=reading workflow instance: unmarshaling instance state: unexpected end of JSON input 2023-03-01T16:36:12.2442985Z value of cmd.Result() creating the error was: 2023-03-01T16:36:12.2444360Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2445141Z Errors while executing commands: 2023-03-01T16:36:12.2461852Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2462516Z Errors while executing commands: 2023-03-01T16:36:12.2478914Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2479392Z Errors while executing commands: 2023-03-01T16:36:12.2519094Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2519627Z Errors while executing commands: 2023-03-01T16:36:12.2556420Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2556895Z Errors while executing commands: 2023-03-01T16:36:12.2614183Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2614804Z Errors while executing commands: 2023-03-01T16:36:12.2685379Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2689882Z Errors while executing commands: 2023-03-01T16:36:12.2833385Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.2834272Z Errors while executing commands: 2023-03-01T16:36:12.3094742Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.3097910Z Errors while executing commands: 2023-03-01T16:36:12.3610442Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.3611001Z Errors while executing commands: 2023-03-01T16:36:12.4017768Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.4018748Z Errors while executing commands: 2023-03-01T16:36:12.4791582Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.4792743Z Errors while executing commands: 2023-03-01T16:36:12.5857694Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.5862567Z Errors while executing commands: 2023-03-01T16:36:12.7754292Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:12.7761000Z Errors while executing commands: 2023-03-01T16:36:13.0046327Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:13.0047266Z Errors while executing commands: 2023-03-01T16:36:13.3513662Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:13.3514236Z Errors while executing commands: 2023-03-01T16:36:14.1269476Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:14.1270632Z Errors while executing commands: 2023-03-01T16:36:14.8357136Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:14.8358162Z Errors while executing commands: 2023-03-01T16:36:15.5398350Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:15.5399295Z Errors while executing commands: 2023-03-01T16:36:16.4011794Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:16.4012689Z Errors while executing commands: 2023-03-01T16:36:17.4719686Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:17.4720249Z Errors while executing commands: 2023-03-01T16:36:18.8361862Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:18.8362531Z Errors while executing commands: 2023-03-01T16:36:19.6299286Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:19.6299849Z Errors while executing commands: 2023-03-01T16:36:20.4275895Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:20.4276528Z Errors while executing commands: 2023-03-01T16:36:21.6800336Z Executed commands: [get instance:d39622e3-5342-4a90-b518-7c097210774d: {"instance":{"instance_id":"d39622e3-5342-4a90-b518-7c097210774d","execution_id":"207545da-f666-4fc9-92c7-87eb048f2235"},"metadata":{},"created_at":"2023-03-01T16:36:12.241931037Z"}] 2023-03-01T16:36:21.6803292Z Errors while executing commands: 2023-03-01T16:36:21.6814345Z e2e.go:51: 2023-03-01T16:36:21.6815158Z Error Trace: /home/runner/work/go-workflows/go-workflows/backend/redis/e2e.go:51 2023-03-01T16:36:21.6816217Z /home/runner/work/go-workflows/go-workflows/backend/redis/e2e.go:588 2023-03-01T16:36:21.6816659Z Error: Received unexpected error: 2023-03-01T16:36:21.6817276Z workflow did not finish in specified timeout 2023-03-01T16:36:21.6817800Z Test: Test_EndToEndRedisBackend/SimpleWorkflow_ExpectedHistory ```
cschleiden commented 1 year ago

🤔 are we that much slower so that we run into the timeout? Or what is canceling the context?

zaataylor commented 1 year ago

🤔 are we that much slower so that we run into the timeout?

I think the timeout could be a side-effect of the original failure to read the instance, possibly? Or the fact that the workflow workflow runPoll timeout (30 seconds) is longer than the time the test waits for the workflow to finish (10 seconds), so maybe if the poll fails the first time, it isn't able to re-poll before test timeout? I could try to reduce the worker timeout to a value lower than the test timeout and see if that makes a difference.

Or what is canceling the context?

I'm still stumped on this, unfortunately ☚ī¸. I thought those pipeline thread-safety changes might have something to do with it, but I'm not seeing a very clear connection yet :/

EDIT The only place I'm seeing a context being canceled that's relevant to the current scenario is in the poll() function at: https://github.com/cschleiden/go-workflows/blob/d2b463129b7f70ebfb202d3efaf1e34abfa2fa5e/internal/worker/workflow.go#L229-L232

zaataylor commented 1 year ago

I could try to reduce the worker timeout to a value lower than the test timeout and see if that makes a difference.

This didn't make a difference, really not sure what the root issue is

cschleiden commented 1 year ago

Superseded by #179