StackStorm / stackstorm-k8s

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" https://stackstorm.com/) Highly Availability fleet as a simple to use reproducible infrastructure-as-code app
https://helm.stackstorm.com/
Apache License 2.0
105 stars 107 forks source link

Several stackstorm pods status show as CrashLoopBackOff or ERROR #171

Closed srehubot closed 3 years ago

srehubot commented 3 years ago

After deploying stackstorm, several pods show status as CrashLookBackOff or ERROR as below:

chatops-st2actionrunner-6d5dc6cbd8-9dbbg      0/1     CrashLoopBackOff   6          10m
chatops-st2actionrunner-6d5dc6cbd8-jkkbg      0/1     CrashLoopBackOff   6          10m
chatops-st2actionrunner-6d5dc6cbd8-k95q8      0/1     CrashLoopBackOff   6          10m
chatops-st2actionrunner-6d5dc6cbd8-q27ls      0/1     CrashLoopBackOff   6          10m
chatops-st2actionrunner-6d5dc6cbd8-xmzkf      1/1     Running            6          10m
chatops-st2api-5cd546c87b-9wdsd               0/1     CrashLoopBackOff   6          10m
chatops-st2api-5cd546c87b-f85q8               0/1     Error              6          10m
chatops-st2auth-688d99c477-6gclq              0/1     CrashLoopBackOff   6          10m
chatops-st2auth-688d99c477-hhhhl              1/1     Running            6          10m
chatops-st2chatops-b9f5fb985-gp686            1/1     Running            0          10m
chatops-st2client-c5df8cbcc-pw6j7             1/1     Running            0          10m
chatops-st2garbagecollector-666fdd669-j2lth   0/1     CrashLoopBackOff   6          10m
chatops-st2notifier-6bdc48b6d8-79cnp          0/1     Error              6          10m
chatops-st2notifier-6bdc48b6d8-7qjr5          0/1     CrashLoopBackOff   6          10m
chatops-st2rulesengine-6c56b7f77d-bn54w       0/1     CrashLoopBackOff   6          10m
chatops-st2rulesengine-6c56b7f77d-sw85n       1/1     Running            6          10m
chatops-st2scheduler-fc4679475-7zvm8          0/1     CrashLoopBackOff   6          10m
chatops-st2scheduler-fc4679475-q4svk          0/1     CrashLoopBackOff   6          10m
chatops-st2sensorcontainer-6cccc7959b-gfvfx   1/1     Running            6          10m
chatops-st2stream-6c98b7db66-nqzqg            1/1     Running            6          10m
chatops-st2stream-6c98b7db66-qvw8t            1/1     Running            6          10m
chatops-st2timersengine-6d48d4755-qtdzd       1/1     Running            6          10m
chatops-st2workflowengine-c66cdf8d8-l8skk     0/1     CrashLoopBackOff   6          10m
chatops-st2workflowengine-c66cdf8d8-qtmvt     0/1     CrashLoopBackOff   6          10m

Error in those pod logs are similar like below:

Traceback (most recent call last):
  File "/tmp/.instana/python/instana/__init__.py", line 194, in <module>
    boot_agent_later()
  File "/tmp/.instana/python/instana/__init__.py", line 112, in boot_agent_later
    Timer(2.0, boot_agent).start()
  File "/usr/lib64/python3.6/threading.py", line 851, in start
    self._started.wait()
  File "/usr/lib64/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/semaphore.py", line 115, in acquire
    hubs.get_hub().switch()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 284, in switch
    assert cur is not self.greenlet, 'Cannot switch to MAINLOOP from MAINLOOP'
AssertionError: Cannot switch to MAINLOOP from MAINLOOP

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2actions/cmd/actionrunner.py", line 67, in _run_worker
    action_worker.wait()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/transport/consumers.py", line 178, in wait
    self._consumer_thread.wait()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
    result = hub.switch()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 350, in run
    self.wait(sleep_time)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 80, in wait
    presult = self.do_poll(seconds)
  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
    return self.poll.poll(seconds)
SystemError: <built-in method poll of select.epoll object at 0x7f63418d02b8> returned a result with an error set

St2 version:

st2 3.2.0, on Python 3.6.8
arm4b commented 3 years ago

Thanks for the report.

I see it's running st2 3.2.0 on py3. We've switched to py3 since 3.3dev in v0.30.0 version of the stackstrom-ha helm chart. What is the chart version you're running and which are the Docker images you're relying on?

Can you try the latest clean default stackstorm-ha Helm Chart from scratch (v0.51.0) with st2 v3.4dev and report back? To understand if it's py environment related or not.

I'm also wondering about this part:

Traceback (most recent call last):
  File "/tmp/.instana/python/instana/__init__.py", line 194, in <module>
    boot_agent_later()
  File "/tmp/.instana/python/instana/__init__.py", line 112, in boot_agent_later
    Timer(2.0, boot_agent).start()
  File "/usr/lib64/python3.6/threading.py", line 851, in start
    self._started.wait()

What is the /tmp/.instana/ and how it plays here?

yypptest commented 3 years ago

The chart version is 0.26.0, I noticed this error just on one environment, and we will try with the suggested version when available.

cognifloyd commented 3 years ago

@yypptest how did your tests go?

cognifloyd commented 3 years ago

There has been no recent activity, so I'm going to assume that this has been resolved through chart, python, or system updates and close it.

Please reopen if you are still experiencing this issue.