Netflix / conductor-community

Apache License 2.0
61 stars 72 forks source link

Postgresql Eventing is not working #51

Open astelmashenko opened 2 years ago

astelmashenko commented 2 years ago

Describe the bug Startup issue, connection is closed and after that queues stop working.

Details Conductor version: 3.4.1 Persistence implementation: Postgres 11.12 on AWS Queue implementation: Postgres Lock: Redis

To Reproduce Steps to reproduce the behavior:

  1. Deploy (Start) Conductor server
  2. Make sure there are workflows added and at least one is running (data is there)
  3. Make blue-green deployment (rollout update in k8s e.g. kubectl rollout restart deployment conductor)
  4. while once instance is stopping and another is starting errors occurs
  5. See error in logs

Expected behavior Rollout startup without issues

Additional context

2022-04-07 13:32:47 
WARN    2022-04-07T10:32:47,366 414546  com.netflix.conductor.contribs.queue.nats.NATSStreamObservableQueue [jnats-callbacks]   onDisconnect. Disconnected for viax_conductor_COMPLETED
2022-04-07 13:32:47 
WARN    2022-04-07T10:32:47,370 414550  com.zaxxer.hikari.pool.ProxyConnection  [pool-23-thread-1]  HikariPool-1 - Connection org.postgresql.jdbc.PgConnection@614aaa83 marked as broken because of SQLSTATE(08006), ErrorCode(0)
2022-04-07 13:32:47 
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
2022-04-07 13:32:47 
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:350) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) ~[HikariCP-3.4.5.jar!/:?]
2022-04-07 13:32:47 
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) ~[HikariCP-3.4.5.jar!/:?]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.util.Query.executeQuery(Query.java:304) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.util.Query.executeAndFetch(Query.java:423) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.lambda$peekMessages$26(PostgresQueueDAO.java:300) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresBaseDAO.query(PostgresBaseDAO.java:225) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.peekMessages(PostgresQueueDAO.java:299) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.popMessages(PostgresQueueDAO.java:314) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.lambda$pollMessages$5(PostgresQueueDAO.java:100) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresBaseDAO.getWithTransactionWithOutErrorPropagation(PostgresBaseDAO.java:166) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.pollMessages(PostgresQueueDAO.java:99) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.postgres.dao.PostgresQueueDAO.pop(PostgresQueueDAO.java:81) ~[conductor-postgres-persistence-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.core.execution.tasks.SystemTaskWorker.pollAndExecute(SystemTaskWorker.java:113) ~[conductor-core-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at com.netflix.conductor.core.execution.tasks.SystemTaskWorker.lambda$startPolling$0(SystemTaskWorker.java:79) ~[conductor-core-3.4.1-SNAPSHOT.jar!/:3.4.1-SNAPSHOT]
2022-04-07 13:32:47 
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
2022-04-07 13:32:47 
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
2022-04-07 13:32:47 
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
2022-04-07 13:32:47 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
2022-04-07 13:32:47 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
2022-04-07 13:32:47 
    at java.lang.Thread.run(Unknown Source) [?:?]
2022-04-07 13:32:47 
Caused by: java.net.SocketException: Connection reset
2022-04-07 13:32:47 
    at java.net.SocketInputStream.read(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at java.net.SocketInputStream.read(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at sun.security.ssl.SSLSocketInputRecord.read(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source) ~[?:?]
2022-04-07 13:32:47 
    at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.PGStream.receiveChar(PGStream.java:443) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2057) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323) ~[postgresql-42.2.20.jar!/:42.2.20]
2022-04-07 13:32:47 
    ... 24 more
apanicker-nflx commented 2 years ago

@rickfish @mactaggart Can you please help look into this? Thanks

rickfish commented 2 years ago

Sorry, I am no longer at my customer that used Postgres.

astelmashenko commented 2 years ago

Additional details: it happens only while we are doing rollout update (blue-green deployment). If we stop and then start application it starts without issues. What we are playing with right now is graceful shutdown checks. And settings like

server.shutdown=graceful
spring.lifecycle.timeout-per-shutdown-phase=1m

also please check link https://docs.spring.io/spring-boot/docs/current/reference/html/deployment.html#deployment.cloud.kubernetes.container-lifecycle there is a note:

When Kubernetes sends a SIGTERM signal to the pod, it waits for a specified time called the termination grace period (the default for which is 30 seconds). If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. If the pod takes longer than 30 seconds to shut down, which could be because you have increased spring.lifecycle.timeout-per-shutdown-phase, make sure to increase the termination grace period by setting the terminationGracePeriodSeconds option in the Pod YAML.

I'll publish update once we get deeper into it (if we get deeper). It is working now with those spring boot settings.