conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
18.15k stars 498 forks source link

workflow_index.update_time NOT NULL contraint breaking upgrade #237

Open lbestatlas opened 3 months ago

lbestatlas commented 3 months ago

Describe the bug This bug has several aspects:

  1. When running the postgres v13 flyway upgrade we have found some workflows without update_time in the json_data column, which causes update_time to be null when the data is backfilled in the v13 script. The subsequent application of the NOT NULL constraint fails
  2. As we are running old and new versions of conductor concurrently as we rollout, workflows executed in the old version have failures indexing workflows as workflow_index.update_time is NULL
  3. The new code does not set the update_time in the case of an update only an insert, so is unlikely to fix the issue it was intended to fix.
  4. The backfill script on the workflow_index can lock up the table for minutes in environments where there is a large number of workflows.

We detected these issues when upgrade our service running v3.20.0 to v3.21.3. We run conductor with the following relevant properties as per the defaults ConductorProperties.asyncIndexingEnabled = false PostgresProperties.onlyIndexOnStatusChange = false

Details Conductor version: 3.21.3 Persistence implementation: Postgres Queue implementation: dynomite Lock: redis Workflow definition: n/a Task definition: n/a Event handler definition: n/a

To Reproduce Run the flyway upgrade with running workflows

Expected behavior The upgrade succeeds and no errors occur during the rollout where new and old versions are executing workflows

Screenshots If applicable, add screenshots to help explain your problem.

Additional context This is an examples of the error from the flyway upgrade

SQL State  : 23502
Error Code : 0
Message    : ERROR: column "update_time" of relation "workflow_index" contains null values
Location   : db/migration_postgres/V13__workflow_index_columns.sql (/opt/service/nested:/opt/service/service.jar/!BOOT-INF/lib/conductor-postgres-persistence-3.21.3.jar!/db/migration_postgres/V13__workflow_index_columns.sql)
Line       : 7
Statement  : Run Flyway with -X option to see the actual statement causing the problem
lbestatlas commented 3 months ago

Our proposed fix is to:

  1. Remove the v13 migration
  2. Add a v14 migration which adds update_time with a default of 'epoch' if the column doesn't exist, and sets the default of 'epoch' if the column does exist
  3. Fix the PostgresIndexDao.indexWorkflow to also set the update_time during an UPDATE
  4. Remove the backfill entirely as this data does not really require to be backfilled for the purpose of this change
dingliu1 commented 2 months ago

Hi, I meet the similar issue when I upgrade from v3.18.0 to v3.21.5 It seems that conductor oss detect the v3.18.0 db schema version is 12 and need do db migration. Then db migration from v12 to v13 failed.

7831 [main] INFO org.flywaydb.core.internal.command.DbValidate [] - Successfully validated 13 migrations (execution time 00:00.060s) 7857 [main] INFO org.flywaydb.core.internal.command.DbMigrate [] - Current version of schema "public": 12 7857 [main] WARN org.flywaydb.core.internal.command.DbMigrate [] - outOfOrder mode is active. Migration of schema "public" may not be reproducible. 7880 [main] INFO org.flywaydb.core.internal.command.DbMigrate [] - Migrating schema "public" to version "13 - workflow index columns" 8703 [main] ERROR org.flywaydb.core.internal.command.DbMigrate [] - Migration of schema "public" to version "13 - workflow index columns" failed! Changes successfully rolled back. 8713 [main] WARN org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext [] - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'taskResource' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-rest.jar!/com/netflix/conductor/rest/controllers/TaskResource.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'taskServiceImpl' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/service/TaskServiceImpl.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'executionService' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/service/ExecutionService.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'workflowExecutorOps' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/core/execution/WorkflowExecutorOps.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'deciderService' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/core/execution/DeciderService.class]: Unsatisfied dependency expressed through constructor parameter 2: Error creating bean with name 'flywayForPrimaryDb' defined in class path resource [com/netflix/conductor/postgres/config/PostgresConfiguration.class]: Migration V13__workflow_index_columns.sql failed

SQL State : 23502 Error Code : 0 Message : ERROR: column "update_time" of relation "workflow_index" contains null values Location : db/migration_postgres/V13__workflow_index_columns.sql (/app/libs/nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-postgres-persistence.jar!/db/migration_postgres/V13__workflow_index_columns.sql) Line : 7 Statement : ALTER TABLE workflow_index ALTER COLUMN update_time SET NOT NULL

8714 [main] INFO com.zaxxer.hikari.HikariDataSource [] - HikariPool-1 - Shutdown initiated...

dingliu1 commented 2 months ago

Found a workaround - we just query all workflows with NULL updateTime and all of them are in running state. We terminate each of the workflow so now they have updateTime. Now upgrade succeeds.