artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Problem: Enduro sometimes can't find the "bundle-activity" #794

Closed djjuhasz closed 7 months ago

djjuhasz commented 10 months ago

Describe the bug

Some Enduro workflow executions fail because the worker is unable to find the "bundle-activity".

To Reproduce

Steps to reproduce the behavior:

  1. Start a transfer as usual

Expected behavior

The workflow execution should find the "bundle-activity" on every run.

Error details

{
  "message": "unable to find activityType=bundle-activity. Supported types: [clean-up-activity, check-sip-structure, metadata-validation, start-transfer-activity, poll-transfer-activity, ZipActivity, internalSessionCreationActivity, internalSessionCompletionActivity, UploadTransferActivity, extract-package, sip-creation, download-activity, allowed-file-formats]",
  "source": "GoSDK",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": null,
  "applicationFailureInfo": {
    "type": "ActivityNotRegisteredError",
    "nonRetryable": false,
    "details": null
  }
}

Additional context

:confused:

djjuhasz commented 10 months ago

Full event log

djjuhasz commented 10 months ago

@sevein @jraddaoui do you have any thoughts on this problem? I'm stumped. :confused:

sevein commented 10 months ago

Very strange. I suggest to upgrade Temporal, I think we're still using 1.20.x.

djjuhasz commented 10 months ago

@jraddaoui I think your secret generator PR fixed the problem! :) I'll give it a few days to see if it re-occurs, and if not I'll close this issue.

djjuhasz commented 10 months ago

This happened again today, so neither the secret generator or upgrading Temporal have fixed the problem. :(

djjuhasz commented 10 months ago

I just encountered a new error that may be related to this one. I was testing with commit d4fd85dd339d523f8aaca37583967cbddff19e11 and got the following error in the Temporal UI:

{
  "message": "unable to decode the activity function input payload with error: payload item 0: unable to decode: json: cannot unmarshal object into Go value of type string for function name: download-activity",
  "source": "GoSDK",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": {
    "message": "payload item 0: unable to decode: json: cannot unmarshal object into Go value of type string",
    "source": "GoSDK",
    "stackTrace": "",
    "encodedAttributes": null,
    "cause": {
      "message": "unable to decode: json: cannot unmarshal object into Go value of type string",
      "source": "GoSDK",
      "stackTrace": "",
      "encodedAttributes": null,
      "cause": {
        "message": "unable to decode",
        "source": "GoSDK",
        "stackTrace": "",
        "encodedAttributes": null,
        "cause": null,
        "applicationFailureInfo": {
          "type": "",
          "nonRetryable": false,
          "details": null
        }
      },
      "applicationFailureInfo": {
        "type": "wrapError",
        "nonRetryable": false,
        "details": null
      }
    },
    "applicationFailureInfo": {
      "type": "wrapError",
      "nonRetryable": false,
      "details": null
    }
  },
  "applicationFailureInfo": {
    "type": "wrapError",
    "nonRetryable": false,
    "details": null
  }
}

Interestingly I changed the output of the download activity from a string to a struct in commit 4f0db3d304ea8c8594861313a7e5f1f8a963d949, and I've restarted the "enduro", "enduro-internal", and "enduro-am" Tilt containers since adding that commit.

I'm starting to wonder if Tilt is sometimes loading old, cached Go binaries instead of the most recent ones, or if old workers aren't always shut down when the container is restarted. :thinking:

djjuhasz commented 10 months ago

Looks like the app version reported in Tilt on the last restart is correct for the three enduro containers.

enduro:

2023-12-07T23:54:13.694Z    V(0)    enduro  enduro/main.go:87   Starting... {"version": "0.1.0-td4fd85dd3", "pid": 1}

enduro-internal:

2023-12-07T23:54:13.692Z    V(0)    enduro  enduro/main.go:87   Starting... {"version": "0.1.0-td4fd85dd3", "pid": 1}

enduro-am

2023-12-07T23:54:13.701Z    V(0)    enduro-am-worker    enduro-am-worker/main.go:68 Starting... {"version": "0.1.0-td4fd85dd3", "pid": 1}
djjuhasz commented 10 months ago

The enduro container shows a "download-workflow" execution (which failed):

2023-12-07T23:54:44.322Z    V(0)    enduro.temporal-client  log/with_logger.go:69   ExecuteActivity {"level": "debug", "Namespace": "default", "TaskQueue": "global", "WorkerID": "1@enduro-7574989cbf-fbsr4@", "WorkflowType": "processing-workflow", "WorkflowID": "processing-workflow-150b833a-1528-4709-95fe-5a6741ae5298", "RunID": "50f855c8-a5f2-4712-b519-bedddc7c17e3", "Attempt": 1, "ActivityID": "14", "ActivityType": "download-activity"}

But enduro-am doesn't show any activity. On a successful run the download activity logs an "Executing DownloadActivity" message (with parameters) in the Tilt enduro-am console.

djjuhasz commented 10 months ago

I just tried a second run without restarting any containers, and got the same "download-activity" error. Interestingly the Temporal UI doesn't show an enduro-am worker registered:

image

djjuhasz commented 10 months ago

Tried restarting containers individually:

  1. Restarted enduro-am, same error. Watcher workflow ran on "enduro-internal", processing workflow ran on "enduro"
  2. Restarted enduro, same error. Watcher workflow ran on "enduro-internal", processing workflow ran on "enduro"
  3. Restarted enduro-internal, SUCCESSFUL run. Watcher workflow ran on "enduro", processing workflow ran on "enduro"

I tried a second transfer (without any further container restarts) and it worked fine too.

sevein commented 7 months ago

@djjuhasz is this still happening?

djjuhasz commented 7 months ago

@sevein no, I haven't had this problem in a while. I'll close the ticket.