dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Packet drops after restarting Temporal Cluster #203

Open dhiaayachi opened 2 weeks ago

dhiaayachi commented 2 weeks ago

Expected Behavior

We have been observing network packet drop (stale_or_unroutable_ip cause), whenever we restart our Temporal Clusters (on new deployments, releases, etc.).

Drops: image

The old IP addresses of the Temporal services are still in the cluster_membership table:

temporal=> SELECT * from cluster_membership;
 membership_partition |              host_id               |  rpc_address  | rpc_port | role |       session_start        |       last_heartbeat       |       record_expiry
----------------------+------------------------------------+---------------+----------+------+----------------------------+----------------------------+----------------------------
                    0 | \x27f0726cc4f511ee8c4a7e886be421e8 | 172.16.36.211 |     6933 |    1 | 2024-02-06 13:39:31.178854 | 2024-02-06 13:54:17.480887 | 2024-02-08 13:54:17.480887
                    0 | \x406ee616c4f511ee9c2c4eff8ed03845 | 172.16.34.153 |     6934 |    2 | 2024-02-06 13:40:12.256704 | 2024-02-06 13:54:18.532144 | 2024-02-08 13:54:18.532144
                    0 | \x9859b450c1a611eeb02a968626816f6d | 172.16.36.100 |     6939 |    4 | 2024-02-02 08:39:36.020235 | 2024-02-06 13:39:21.505523 | 2024-02-08 13:39:21.505523
                    0 | \x8e3abc89c1a611ee860fba5a10a03e3e | 172.16.34.175 |     6933 |    1 | 2024-02-02 08:39:19.077945 | 2024-02-06 13:39:21.516706 | 2024-02-08 13:39:21.516706
                    0 | \x2737ad64c4f511ee9e3dea8b31a7ca50 | 172.16.37.76  |     6935 |    3 | 2024-02-06 13:39:29.943083 | 2024-02-06 13:54:19.267353 | 2024-02-08 13:54:19.267353
                    0 | \x8bf5d329c1a611eeab2d1663815337d5 | 172.16.25.71  |     6934 |    2 | 2024-02-02 08:39:15.230349 | 2024-02-06 13:39:22.505874 | 2024-02-08 13:39:22.505874
                    0 | \x2767ad39c4f511eeb5b4e2f2f4ca461b | 172.16.33.254 |     6934 |    2 | 2024-02-06 13:39:30.276643 | 2024-02-06 13:54:19.615724 | 2024-02-08 13:54:19.615724
                    0 | \x8cf287a6c1a611eea8ee2630adf79dee | 172.16.25.155 |     6933 |    1 | 2024-02-02 08:39:16.920261 | 2024-02-06 13:39:24.229088 | 2024-02-08 13:39:24.229088
                    0 | \x8d2695a8c1a611eeaf43ba9bd9d72b37 | 172.16.34.249 |     6936 |    5 | 2024-02-02 08:39:17.234805 | 2024-02-06 13:39:24.928443 | 2024-02-08 13:39:24.928443
                    0 | \x9b785481c1a611ee8e1b626cb429a928 | 172.16.37.235 |     6933 |    1 | 2024-02-02 08:39:41.263783 | 2024-02-06 13:39:25.923658 | 2024-02-08 13:39:25.923658
                    0 | \x26343fa1c4f511eebb64d26e62cd671b | 172.16.23.217 |     6933 |    1 | 2024-02-06 13:39:28.256783 | 2024-02-06 13:54:19.640885 | 2024-02-08 13:54:19.640885
                    0 | \x2ec9f65fc4f511eebb0c42b810af6adf | 172.16.10.37  |     6933 |    1 | 2024-02-06 13:39:42.645453 | 2024-02-06 13:54:19.924845 | 2024-02-08 13:54:19.924845
                    0 | \x3546754ac4f511eeb0df5e406aef04b7 | 172.16.9.44   |     6939 |    4 | 2024-02-06 13:39:53.52417  | 2024-02-06 13:54:21.815246 | 2024-02-08 13:54:21.815246
                    0 | \x9b761baac1a611ee86861ee633e6ff6e | 172.16.37.57  |     6934 |    2 | 2024-02-02 08:39:41.249319 | 2024-02-06 13:40:03.892467 | 2024-02-08 13:40:03.892467
                    0 | \x2bd60bffc4f511ee8038f2a15c53c361 | 172.16.33.218 |     6933 |    1 | 2024-02-06 13:39:37.706043 | 2024-02-06 13:54:22.096942 | 2024-02-08 13:54:22.096942
                    0 | \x2e034424c4f511ee8b8422fe0b9db541 | 172.16.2.98   |     6939 |    4 | 2024-02-06 13:39:41.342732 | 2024-02-06 13:54:22.630096 | 2024-02-08 13:54:22.630096
                    0 | \x264e5763c4f511eeb6516a80b6635bad | 172.16.8.104  |     6936 |    5 | 2024-02-06 13:39:28.414856 | 2024-02-06 13:54:23.857224 | 2024-02-08 13:54:23.857224
                    0 | \x8d48423fc1a611eeb3267a0ba9253295 | 172.16.33.56  |     6936 |    5 | 2024-02-02 08:39:17.478502 | 2024-02-06 13:39:28.570702 | 2024-02-08 13:39:28.570702
                    0 | \x2a052748c4f511eea9d3be82b1fb36b9 | 172.16.23.57  |     6936 |    5 | 2024-02-06 13:39:34.658886 | 2024-02-06 13:54:25.05269  | 2024-02-08 13:54:25.05269
                    0 | \x30ba0b59c4f511ee9640622359ce980d | 172.16.36.75  |     6935 |    3 | 2024-02-06 13:39:45.89489  | 2024-02-06 13:54:25.22052  | 2024-02-08 13:54:25.22052
                    0 | \x2ef49a0cc4f511ee8241c67c007be2c5 | 172.16.11.190 |     6935 |    3 | 2024-02-06 13:39:42.933887 | 2024-02-06 13:54:25.330191 | 2024-02-08 13:54:25.330191
                    0 | \x44663929c4f511eea6d4c6614722b5d8 | 172.16.37.45  |     6934 |    2 | 2024-02-06 13:40:18.911727 | 2024-02-06 13:54:26.233305 | 2024-02-08 13:54:26.233305
                    0 | \x8fb19965c1a611ee8a3daac84ef7ce0e | 172.16.33.10  |     6935 |    3 | 2024-02-02 08:39:21.508819 | 2024-02-06 13:39:31.954545 | 2024-02-08 13:39:31.954545
                    0 | \x903b6e8bc1a611ee95c81e7a8687d124 | 172.16.25.182 |     6939 |    4 | 2024-02-02 08:39:22.394533 | 2024-02-06 13:39:36.146894 | 2024-02-08 13:39:36.146894
                    0 | \x9b7215a2c1a611ee95cb16a506174482 | 172.16.37.81  |     6939 |    4 | 2024-02-02 08:39:41.208757 | 2024-02-06 13:39:40.488594 | 2024-02-08 13:39:40.488594
                    0 | \x8f9e505bc1a611ee86e66e9c4e8dfa7b | 172.16.34.121 |     6935 |    3 | 2024-02-02 08:39:21.388292 | 2024-02-06 13:39:41.75921  | 2024-02-08 13:39:41.75921
                    0 | \x90305675c1a611eeaf185a86e9283977 | 172.16.25.153 |     6935 |    3 | 2024-02-02 08:39:22.317386 | 2024-02-06 13:39:42.811889 | 2024-02-08 13:39:42.811889
                    0 | \x8c09ae21c1a611ee90916e2ec790e4bc | 172.16.33.9   |     6934 |    2 | 2024-02-02 08:39:15.388081 | 2024-02-06 13:39:04.497511 | 2024-02-08 13:39:04.497511
                    0 | \x8c135f47c1a611ee8f015a97772859d3 | 172.16.33.97  |     6933 |    1 | 2024-02-02 08:39:15.460704 | 2024-02-06 13:39:13.370839 | 2024-02-08 13:39:13.370839
                    0 | \x8d28167fc1a611eeb2f7f225c06c088b | 172.16.34.92  |     6934 |    2 | 2024-02-02 08:39:17.259959 | 2024-02-06 13:39:46.189244 | 2024-02-08 13:39:46.189244
                    0 | \x2633938fc4f511eebef86e735020f6ec | 172.16.23.84  |     6939 |    4 | 2024-02-06 13:39:28.253228 | 2024-02-06 13:54:15.639278 | 2024-02-08 13:54:15.639278
                    0 | \x34a909b4c4f511eebde066f106d6cd31 | 172.16.25.55  |     6934 |    2 | 2024-02-06 13:39:52.50008  | 2024-02-06 13:54:15.744222 | 2024-02-08 13:54:15.744222
(32 rows)
temporal-prod-operator-6b9bf85f75-mp4h4:/$ tctl adm cl d
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0",
    "temporal-ui": "\u003c3.0.0"
  },
  "serverVersion": "1.22.4",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.16.10.37:7233"
    },
    "reachableMembers": [
      "172.16.25.55:6934",
      "172.16.8.104:6936",
      "172.16.37.76:6935",
      "172.16.9.44:6939",
      "172.16.23.57:6936",
      "172.16.37.45:6934",
      "172.16.2.98:6939",
      "172.16.36.75:6935",
      "172.16.10.37:6933",
      "172.16.23.217:6933",
      "172.16.33.218:6933",
      "172.16.34.153:6934",
      "172.16.36.211:6933",
      "172.16.11.190:6935",
      "172.16.23.84:6939",
      "172.16.33.254:6934"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 4,
        "members": [
          {
            "identity": "172.16.23.217:7233"
          },
          {
            "identity": "172.16.36.211:7233"
          },
          {
            "identity": "172.16.33.218:7233"
          },
          {
            "identity": "172.16.10.37:7233"
          }
        ]
      },
      {
        "role": "internal-frontend",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.16.8.104:7236"
          },
          {
            "identity": "172.16.23.57:7236"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 4,
        "members": [
          {
            "identity": "172.16.33.254:7234"
          },
          {
            "identity": "172.16.34.153:7234"
          },
          {
            "identity": "172.16.25.55:7234"
          },
          {
            "identity": "172.16.37.45:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 3,
        "members": [
          {
            "identity": "172.16.11.190:7235"
          },
          {
            "identity": "172.16.37.76:7235"
          },
          {
            "identity": "172.16.36.75:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 3,
        "members": [
          {
            "identity": "172.16.23.84:7239"
          },
          {
            "identity": "172.16.2.98:7239"
          },
          {
            "identity": "172.16.9.44:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "25cb4f70-a15f-4ae8-81bd-ddb68242a8eb",
  "clusterName": "active",
  "historyShardCount": 8192,
  "persistenceStore": "postgres",
  "visibilityStore": "postgres",
  "failoverVersionIncrement": "10",
  "initialFailoverVersion": "1"
}

Actual Behavior

I expect the old IPs to be removed from the cluster_membership table.

Steps to Reproduce the Problem

  1. Restart temporal services

Specifications

dhiaayachi commented 22 hours ago

Thank you for reporting this issue.

Based on the attached documents, it seems like this issue is not a known issue with Temporal Server version 1.22.4.

To resolve this issue, you can try:

DELETE FROM cluster_membership WHERE rpc_address IN ('172.16.36.211', '172.16.34.153', ...);

Please provide the following information to help me further troubleshoot this issue:

I will try my best to help you resolve this issue.