datopian / ckan-cloud-operator

CKAN Cloud operator manages, provisions and configures Ckan Cloud instances and related infrastructure.
MIT License
19 stars 14 forks source link

Solr pods are not stable on aks #123

Closed zelima closed 4 years ago

zelima commented 4 years ago

Solr pods keep restarting from time to time leading problems in deployment. Eg ckan instance will fail to connect to solr if one of them is in CrashLoop. For 26 h they've restarted >200 time some of them

kubectl get pods -n ckan-cloud
NAME                                                              READY   STATUS             RESTARTS   AGE
ckan-cloud-provider-db-proxy-azuresql-proxy-5c67484c4-82kp8       1/1     Running            0          26h
ckan-cloud-provider-solr-solrcloud-sc-3-d7bc4cc9f-nrpb2           1/1     Running            111        26h
ckan-cloud-provider-solr-solrcloud-sc-4-56f96f98c-swvwk           0/1     Running            133        26h
ckan-cloud-provider-solr-solrcloud-sc-5-584bb7ddd4-hgv6q          1/1     Running            136        26h
ckan-cloud-provider-solr-solrcloud-zk-0-8964b55b4-nf6hg           0/1     OOMKilled          177        26h
ckan-cloud-provider-solr-solrcloud-zk-1-55f6744d9f-d9blr          0/1     Running            232        26h
ckan-cloud-provider-solr-solrcloud-zk-2-7d4fd5fb-rv42c            0/1     CrashLoopBackOff   213        26h
ckan-cloud-provider-solr-solrcloud-zoonavigator-7bdd5575bf89ph4   2/2     Running            0          26h
router-traefik-instances-default-55cf88f7cd-29srv                 1/1     Running            0          20h

tasks

zelima commented 4 years ago

This is what I see from kubectl describe

Events:
  Type     Reason     Age                    From                             Message
  ----     ------     ----                   ----                             -------
  Warning  Unhealthy  42m (x28 over 6d3h)    kubelet, aks-default-18083181-1  Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "argument list too long": unknown
  Warning  Unhealthy  25m (x2 over 3d)       kubelet, aks-default-18083181-1  Liveness probe failed: /usr/bin/zkOk.sh: line 21: /bin/nc: Cannot allocate memory
  Warning  Unhealthy  25m (x199 over 6d3h)   kubelet, aks-default-18083181-1  Readiness probe failed:
  Warning  Unhealthy  54s (x460 over 6d4h)   kubelet, aks-default-18083181-1  Liveness probe failed:
  Warning  BackOff    52s (x7574 over 6d4h)  kubelet, aks-default-18083181-1  Back-off restarting failed container

Seems like zookeeper has very low memory allocated.

@akariv I see resources for SOLR pods are updated in this commit https://github.com/datopian/ckan-cloud-operator/commit/22423feec2806ce6a0465c16e43ed5356cf32348#diff-167428a1f2e6a73b2e24fcc3c27fbd97L275

What do you think if I make them configurable (with the current defaults) Eg in interactive mode? Right now resources for both zookeeper and solrcloud are hardcoded to 200mi and 8GB.

Something like

diff --git a/interactive.yaml b/interactive.yaml
index abec8fc..381200e 100644
--- a/interactive.yaml
+++ b/interactive.yaml
@@ -12,9 +12,11 @@ default:
       self-hosted: y
       num-shards: "1"
       replication-factor: "1"
+      sc-resources: '{"limits":{"memory":"1Gi"}, "requested": {"memory":"1Gi"}}'
+      zk-resources: '{"limits":{"memory":"1Gi", "cpu":"1"}, "requested": {"memory":"1Gi", "cpu":"0.5"}}'
     ckan-storage-config:
       default-storage-bucket: ckan
akariv commented 4 years ago

I hope this is indeed the problem that's causing this, but we can try. I see no harm in making these configurable.