[bug] several containers in minnesota restart frequently #388

Closed LavenderQAQ closed 1 year ago

LavenderQAQ commented 1 year ago

🐞 Bug Report

Affected Services [REQUIRED]

The issue is located in: app-service-configurable, command, support-notification, support-scheduler, core-metadata ### Is this a regression? Yes, the previous version in which this bug was not present was: levski and all previous versions ### Description and Minimal Reproduction [**REQUIRED**] I deployed the latest version using docker-compose with the command: ```shell make run no-secty ``` ## πŸ”₯ Exception or Error

Many containers are in a state of constant restart:

CONTAINER ID   IMAGE                                         COMMAND                  CREATED        STATUS                  PORTS                                                                        NAMES
9bb8fe964f80   edgexfoundry/app-service-configurable:3.0.0   "/app-service-config…"   14 hours ago   Up 29 seconds           48095/tcp,>59701/tcp                                        edgex-app-rules-engine
e4426ebbda4c   edgexfoundry/device-rest:3.0.0                "/device-rest --cp=c…"   14 hours ago   Up 58 seconds >59986/tcp                                                   edgex-device-rest
0ced9d3f9e2b   edgexfoundry/device-virtual:3.0.0             "/device-virtual --c…"   14 hours ago   Up Less than a second>59900/tcp                                                   edgex-device-virtual
26a67cae8242   edgexfoundry/core-data:3.0.0                  "/core-data -cp=cons…"   14 hours ago   Up 29 seconds >59880/tcp                                                   edgex-core-data
1faa2e105738   edgexfoundry/core-command:3.0.0               "/core-command -cp=c…"   14 hours ago   Up 44 seconds >59882/tcp                                                   edgex-core-command
0a6a22b9a751   edgexfoundry/support-notifications:3.0.0      "/support-notificati…"   14 hours ago   Up 14 seconds >59860/tcp                                                   edgex-support-notifications
402748e680c7   lfedge/ekuiper:1.9.2-alpine                   "/usr/bin/docker-ent…"   14 hours ago   Up 14 hours             9081/tcp, 20498/tcp,>59720/tcp                              edgex-kuiper
12eb1d3e17e4   edgexfoundry/support-scheduler:3.0.0          "/support-scheduler …"   14 hours ago   Up 44 seconds >59861/tcp                                                   edgex-support-scheduler
43300a40e3a6   edgexfoundry/core-metadata:3.0.0              "/core-metadata -cp=…"   14 hours ago   Up 19 seconds >59881/tcp                                                   edgex-core-metadata
12c124bc9809   hashicorp/consul:1.15.2                       "docker-entrypoint.s…"   14 hours ago   Up 14 hours             8300-8302/tcp, 8301-8302/udp, 8600/tcp, 8600/udp,>8500/tcp   edgex-core-consul
c486850d9b07   redis:7.0.11-alpine                           "docker-entrypoint.s…"   14 hours ago   Up 14 hours   >6379/tcp                                                     edgex-redis
55b23383028a   edgexfoundry/edgex-ui:3.0.0                   "./edgex-ui-server -…"   14 hours ago   Up 14 hours   >4000/tcp, :::4000->4000/tcp                

🌍 Your Environment

Deployment Environment:

Linux 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

EdgeX Version [REQUIRED]: minnesota

Anything else relevant?

LavenderQAQ commented 1 year ago

Most components report error:

msg="configuration provider is not available"
cloudxxx8 commented 1 year ago

everything works fine from my computer. it looks like you miss the conainer edgex-core-common-config-bootstrapper what is your steps? here is my steps:

  1. git clone a clean edgex-compose
  2. git checkout minnesota branch
  3. make run no-secty
LavenderQAQ commented 1 year ago

@cloudxxx8 My steps are exactly the same as yours, and I haven't changed any code. I just checked it out and found that edgex-core-common-config-bootstrapper unexpectedly quit:

0714c01e6b4d.  edgexfoundry/core-common-config-bootstrapper:3.0.0   "entrypoint.sh /core…"   About a minute ago   Exited (1) 44 seconds ago   edgex-core-common-config-bootstrapper

Here is its last log:

level=ERROR ts=2023-06-08T00:52:55.640858938Z app=core-common-config-bootstrapper source=main.go:116 msg="failed to determine if common configuration exists in the provider: checking configuration existence from Consul failed: Unexpected response code: 503 (ERROR: The requested URL could not be retrieved - Unable to determine IP address from host name 'edgex-core-consul' - The DNS server returned: Name Error: The domain name does not exist.)"
cloudxxx8 commented 1 year ago

according to the error log, it is about consul, so you need to check the log from consul.

LavenderQAQ commented 1 year ago

@cloudxxx8 It's weird. consul's logs are working fine:

==> Starting Consul agent...
              Version: '1.15.2'
           Build Date: '2023-03-30 17:51:19 +0000 UTC'
              Node ID: '1afb88bb-2185-6dbb-0995-6d92aefa455e'
            Node name: 'edgex-core-consul'
           Datacenter: 'dc1' (Segment: '<all>')
               Server: true (Bootstrap: true)
          Client Addr: [] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: 8503, DNS: 8600)
         Cluster Addr: (LAN: 8301, WAN: 8302)
    Gossip Encryption: false
     Auto-Encrypt-TLS: false
            HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
             gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
     Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2

==> Log data will now stream in as it occurs:

2023-06-08T05:39:58.803Z [WARN]  agent: bootstrap = true: do not enable unless necessary
2023-06-08T05:39:58.809Z [WARN]  agent.auto_config: bootstrap = true: do not enable unless necessary
2023-06-08T05:39:59.041Z [INFO]  agent.server.raft: initial configuration: index=8273 servers="[{Suffrage:Voter ID:1afb88bb-2185-6dbb-0995-6d92aefa455e Address:}]"
2023-06-08T05:39:59.041Z [INFO]  agent.server.raft: entering follower state: follower="Node at [Follower]" leader-address= leader-id=
2023-06-08T05:39:59.042Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: edgex-core-consul.dc1
2023-06-08T05:39:59.042Z [WARN]  agent.server.serf.wan: serf: Failed to re-join any previously known node
2023-06-08T05:39:59.043Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: edgex-core-consul
2023-06-08T05:39:59.043Z [INFO]  agent.router: Initializing LAN area manager
2023-06-08T05:39:59.043Z [WARN]  agent.server.serf.lan: serf: Failed to re-join any previously known node
2023-06-08T05:39:59.043Z [INFO]  agent.server: Adding LAN server: server="edgex-core-consul (Addr: tcp/ (DC: dc1)"
2023-06-08T05:39:59.043Z [INFO]  agent.server.autopilot: reconciliation now disabled
2023-06-08T05:39:59.044Z [INFO]  agent.server: Handled event for server in area: event=member-join server=edgex-core-consul.dc1 area=wan
2023-06-08T05:39:59.046Z [INFO]  agent.server.cert-manager: initialized server certificate management
2023-06-08T05:39:59.047Z [INFO]  agent: Started DNS server: address= network=tcp
2023-06-08T05:39:59.047Z [INFO]  agent: Started DNS server: address= network=udp
2023-06-08T05:39:59.047Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
2023-06-08T05:39:59.048Z [INFO]  agent: Started gRPC listeners: port_name=grpc_tls address=[::]:8503 network=tcp
2023-06-08T05:39:59.048Z [INFO]  agent: started state syncer
2023-06-08T05:39:59.048Z [INFO]  agent: Consul agent running!
2023-06-08T05:40:05.239Z [INFO]  agent: Newer Consul version available: new_version=1.15.3 current_version=1.15.2
2023-06-08T05:40:05.940Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-06-08T05:40:05.940Z [INFO]  agent.server.raft: entering candidate state: node="Node at [Candidate]" term=5
2023-06-08T05:40:05.948Z [INFO]  agent.server.raft: election won: term=5 tally=1
2023-06-08T05:40:05.948Z [INFO]  agent.server.raft: entering leader state: leader="Node at [Leader]"
2023-06-08T05:40:05.948Z [INFO]  agent.server: cluster leadership acquired
2023-06-08T05:40:05.948Z [INFO]  agent.server: New leader elected: payload=edgex-core-consul
2023-06-08T05:40:06.230Z [INFO]  agent.server.autopilot: reconciliation now enabled
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="federation state anti-entropy"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="federation state pruning"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="streaming peering resources"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="metrics for streaming peering resources"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="peering deferred deletion"
2023-06-08T05:40:06.231Z [INFO]  connect.ca: initialized primary datacenter CA from existing CARoot with provider: provider=consul
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="intermediate cert renew watch"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="CA root pruning"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="CA root expiration metric"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="CA signing expiration metric"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="virtual IP version check"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: started routine: routine="config entry controllers"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: stopping routine: routine="virtual IP version check"
2023-06-08T05:40:06.231Z [INFO]  agent.leader: stopped routine: routine="virtual IP version check"
2023-06-08T05:40:06.231Z [INFO]  agent.server.raft: updating configuration: command=AddVoter server-id=1afb88bb-2185-6dbb-0995-6d92aefa455e server-addr= servers="[{Suffrage:Voter ID:1afb88bb-2185-6dbb-0995-6d92aefa455e Address:}]"
2023-06-08T05:40:06.235Z [INFO]  agent.server: member joined, marking health alive: member=edgex-core-consul partition=default
2023-06-08T05:40:17.606Z [INFO]  agent: Synced node info
LavenderQAQ commented 1 year ago

I redeployed after making clean and this problem didn't arise. This may be due to legacy configurations from previous versions.