DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Investigate and resolve metadig outage on production cluster #27

Closed amoeba closed 2 years ago

amoeba commented 2 years ago

Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from edu.ucsb.nceas.mdqengine.store.DatabaseStore.

Example stack trace ``` org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) 20220310-23:56:51: [ERROR]: org.postgresql.util.PSQLException: The connection attempt failed. [edu.ucsb.nceas.mdqengine.store.DatabaseStore] at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more edu.ucsb.nceas.mdqengine.exception.MetadigStoreException: Unable to create the database store. at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:90) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) ... 50 more Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more ```

The root cause for the failed connection looks like the message at the bottom of the above output,

Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local.

; kubectl get pods --namespace=metadig
NAME                                  READY   STATUS    RESTARTS      AGE
metadig-controller-7978fd8bb7-f94mp   1/1     Running   0             15d
metadig-scheduler-7b97494fc6-tfjr2    1/1     Running   0             2d1h
metadig-scorer-6dc58c6b7d-m68xq       1/1     Running   0             15d
metadig-worker-76c5884885-78n8r       1/1     Running   0             15d
metadig-worker-76c5884885-8zt9j       1/1     Running   0             15d
metadig-worker-76c5884885-bk6mz       1/1     Running   0             15d
metadig-worker-76c5884885-jct4r       1/1     Running   0             15d
metadig-worker-76c5884885-jfmnc       1/1     Running   0             15d
metadig-worker-76c5884885-kcl5t       1/1     Running   0             15d
metadig-worker-76c5884885-kszbs       1/1     Running   0             15d
metadig-worker-76c5884885-m8sfj       1/1     Running   0             15d
metadig-worker-76c5884885-rx566       1/1     Running   2 (15d ago)   15d
metadig-worker-76c5884885-zf5r2       1/1     Running   2 (15d ago)   15d
postgres-78477d4df8-h628z             2/2     Running   0             15d
rabbitmq-fc49dbd56-pbhm9              1/1     Running   0             15d
; kubectl get services --namespace=metadig
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
metadig-controller   ClusterIP   10.103.244.83   <none>        8080/TCP             37d
postgres             ClusterIP   10.109.33.23    <none>        5432/TCP,6432/TCP    65d
rabbitmq             ClusterIP   10.108.196.79   <none>        5672/TCP,15672/TCP   65d
amoeba commented 2 years ago

I ran through https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ and found something interesting. When I exec into one of the pods that can't see Postgres and dump its resolv.conf,

; kubectl exec --namespace=metadig -it metadig-controller-7978fd8bb7-f94mp -- /bin/bash

bash-4.4# cat /etc/resolv.conf
nameserver 10.96.0.10
search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5

I get 10.96.0.10. That does appear to be the right cluster-local IP address,

kubectl get svc --namespace=kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   2y21d

But the above pod can't ping 10.96.0.10:

bash-4.4# ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
...100% packet loss
nickatnceas commented 2 years ago

No dig available in that pod, but it looks like it can communicate with the DNS server via the limited nslookup command:

bash-4.4# nslookup postgres 10.96.0.10
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      postgres
Address 1: 10.109.33.23 postgres.metadig.svc.cluster.local
amoeba commented 2 years ago

Thanks for looking @nickatnceas. nslookup was what I ran before but with the name postgres.metadig.svc.cluster.local instead of postgres. It's not really clear to me what the difference is between postgres and postgres.metadig.svc.cluster.local.

gothub commented 2 years ago

This problem is also occurring on the dev cluster. Connections from metadig containers to both RabbitMQ and Postgres services are failing. It looks like k8s core-dns is providing the full DNS name, but it's not clear to me that the name is being resolved correctly. I ran the dns-utils tests that you mentioned and didn't see errors. k8s has been upgraded recently, but calico networking was not. So, I upgraded calico to v3.22.1 on dev k8s which did not resolve the problem. I'll continue to look for problems with calico and/or core-dns.

BTW - we had a problem with calico awhile ago related to how it identifies nodes in the k8s cluster. This doesn't seem to be the problem, but if anyone is interested, the issue was written up here https://github.com/NCEAS/metadig-engine/issues/288.

On Thu, Mar 10, 2022 at 4:37 PM Bryce Mecum @.***> wrote:

Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from edu.ucsb.nceas.mdqengine.store.DatabaseStore. Example stack trace

org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) 20220310-23:56:51: [ERROR]: org.postgresql.util.PSQLException: The connection attempt failed. [edu.ucsb.nceas.mdqengine.store.DatabaseStore] at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more edu.ucsb.nceas.mdqengine.exception.MetadigStoreException: Unable to create the database store. at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:90) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) ... 50 more Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more

The root cause for the failed connection looks like the message at the bottom of the above output,

Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local.

; kubectl get pods --namespace=metadig NAME READY STATUS RESTARTS AGE metadig-controller-7978fd8bb7-f94mp 1/1 Running 0 15d metadig-scheduler-7b97494fc6-tfjr2 1/1 Running 0 2d1h metadig-scorer-6dc58c6b7d-m68xq 1/1 Running 0 15d metadig-worker-76c5884885-78n8r 1/1 Running 0 15d metadig-worker-76c5884885-8zt9j 1/1 Running 0 15d metadig-worker-76c5884885-bk6mz 1/1 Running 0 15d metadig-worker-76c5884885-jct4r 1/1 Running 0 15d metadig-worker-76c5884885-jfmnc 1/1 Running 0 15d metadig-worker-76c5884885-kcl5t 1/1 Running 0 15d metadig-worker-76c5884885-kszbs 1/1 Running 0 15d metadig-worker-76c5884885-m8sfj 1/1 Running 0 15d metadig-worker-76c5884885-rx566 1/1 Running 2 (15d ago) 15d metadig-worker-76c5884885-zf5r2 1/1 Running 2 (15d ago) 15d postgres-78477d4df8-h628z 2/2 Running 0 15d rabbitmq-fc49dbd56-pbhm9 1/1 Running 0 15d

; kubectl get services --namespace=metadig NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE metadig-controller ClusterIP 10.103.244.83 8080/TCP 37d postgres ClusterIP 10.109.33.23 5432/TCP,6432/TCP 65d rabbitmq ClusterIP 10.108.196.79 5672/TCP,15672/TCP 65d

— Reply to this email directly, view it on GitHub https://github.com/DataONEorg/k8s-cluster/issues/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5VEW6COQWMM5M5B4JP7TU7KIUTANCNFSM5QOGFBZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

-- Peter Slaughter, Software Engineer National Center for Ecological Analysis and Synthesis Santa Barbara, CA 93101

gothub commented 2 years ago

On prod k8s, if I login to the metadig-controller pod/container, I'm able to ping the metadig postgres pod/container IP, which is setup by calico:

ping 192.168.181.142

PING 192.168.181.142 (192.168.181.142): 56 data bytes 64 bytes from 192.168.181.142: seq=0 ttl=62 time=1.874 ms 64 bytes from 192.168.181.142: seq=1 ttl=62 time=1.126 ms

This tells me that the calico managed overlay network is working.

If I try to ping anything on the k8s 'service' network, I get no response. For example, pinging the kube-dns service from metadig-controller pod/container:

ping 10.108.92.153

PING 10.108.92.153 (10.108.92.153): 56 data bytes

Also, if I try to ping using the DNS name for the postgres service, the command hangs.

Connectivity was working on Wed. 3/9/22, so what could have changed. I didn't perform any upgrades last week.

Also, here is the /etc/resolv.conf from the metadig-controller container: /usr/local/tomcat # cat /etc/resolv.conf nameserver 10.96.0.10 search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org options ndots:5

Any ideas?

On Thu, Mar 10, 2022 at 4:37 PM Bryce Mecum @.***> wrote:

Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from edu.ucsb.nceas.mdqengine.store.DatabaseStore. Example stack trace

org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) 20220310-23:56:51: [ERROR]: org.postgresql.util.PSQLException: The connection attempt failed. [edu.ucsb.nceas.mdqengine.store.DatabaseStore] at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more edu.ucsb.nceas.mdqengine.exception.MetadigStoreException: Unable to create the database store. at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:90) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.(DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) ... 50 more Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream.(PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more

The root cause for the failed connection looks like the message at the bottom of the above output,

Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local.

; kubectl get pods --namespace=metadig NAME READY STATUS RESTARTS AGE metadig-controller-7978fd8bb7-f94mp 1/1 Running 0 15d metadig-scheduler-7b97494fc6-tfjr2 1/1 Running 0 2d1h metadig-scorer-6dc58c6b7d-m68xq 1/1 Running 0 15d metadig-worker-76c5884885-78n8r 1/1 Running 0 15d metadig-worker-76c5884885-8zt9j 1/1 Running 0 15d metadig-worker-76c5884885-bk6mz 1/1 Running 0 15d metadig-worker-76c5884885-jct4r 1/1 Running 0 15d metadig-worker-76c5884885-jfmnc 1/1 Running 0 15d metadig-worker-76c5884885-kcl5t 1/1 Running 0 15d metadig-worker-76c5884885-kszbs 1/1 Running 0 15d metadig-worker-76c5884885-m8sfj 1/1 Running 0 15d metadig-worker-76c5884885-rx566 1/1 Running 2 (15d ago) 15d metadig-worker-76c5884885-zf5r2 1/1 Running 2 (15d ago) 15d postgres-78477d4df8-h628z 2/2 Running 0 15d rabbitmq-fc49dbd56-pbhm9 1/1 Running 0 15d

; kubectl get services --namespace=metadig NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE metadig-controller ClusterIP 10.103.244.83 8080/TCP 37d postgres ClusterIP 10.109.33.23 5432/TCP,6432/TCP 65d rabbitmq ClusterIP 10.108.196.79 5672/TCP,15672/TCP 65d

— Reply to this email directly, view it on GitHub https://github.com/DataONEorg/k8s-cluster/issues/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5VEW6COQWMM5M5B4JP7TU7KIUTANCNFSM5QOGFBZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

-- Peter Slaughter, Software Engineer National Center for Ecological Analysis and Synthesis Santa Barbara, CA 93101

gothub commented 2 years ago

Hey @nick - when you have a moment, could you have a look at the entry above?

In particular, do you have any insight why k8s DNS may not be resolving names? In June last year, we ran into a problem where Ceph IP network (10.0.3.x) was confusing the calico system to try and add those to the k8s overlay network. We re-configured calico to ignore those (described here).

Could this new issue be related to a recent change to Ceph? Any additional ideas on how to debug k8s DNS?

gothub commented 2 years ago

This still appears to be a problem with core-dns. If I login to the metadig-controller pod, the following DNS query works:

# nslookup rabbitmq
nslookup: can't resolve '(null)': Name does not resolve

Name:      rabbitmq
Address 1: 10.109.162.239 rabbitmq.metadig.svc.cluster.local

The names rabbitmq.metadig and rabbitmq.metadig.svc work.

However the following fully qualified DNS 'rabbitmq.metadig.svc.cluster.local` does not resolve, but should:

# nslookup rabbitmq.metadig.svc.cluster.local
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'rabbitmq.metadig.svc.cluster.local': Name does not resolve

Also, external names do not resolve, such as cn.dataone.org

gothub commented 2 years ago

kubelet creates an /etc/resolv.conf in every pod/container that it creates. For metadig namespace pods, the resolve.conf is:

nameserver 10.96.0.10
search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5

Removing the dataone.org from the file appears to have fixed the problem. For example, after the edit, all these names can be resolved (and not before):

www.google.com
cn.dataone.org
metadig-controller.metadig.svc.cluster.local
knb.ecoinformatics.org

There is a way to tell kubelet how to create the resolv.conf file, I just have to look that up and test it on the dev k8s cluster.

So, it has yet to be determined why this problem is happening now, but at least it should be easy to fix. This fix will be added to the k8s-config docs.

Customizing k8s dns is described here

gothub commented 2 years ago

One way to cleanly do this is to modify the deployment dnsPolicy. Here is the YAML version:

  dnsPolicy: None
  dnsConfig:
    nameservers:
      - 10.96.0.10
    searches:
    - metadig.svc.cluster.local
    - svc.cluster.local
    - cluster.local
    options:
      - name: ndots
        value: "5"

With helm, it should be possible to query and inject the nameserver IP, which is the IP of the core-dns service.

gothub commented 2 years ago

Superceded and resolved in issue https://github.com/NCEAS/metadig-engine/issues/312

amoeba commented 2 years ago

Thanks for tracking this down @gothub. I'm a little concerned about the fix, though.

Mainly, shouldn't our CNI plugin just be handling this without any config? That hard-coded cluster-local IP address worries me and so does having to apply this bit of config to every service we run in the cluster. Am I interpreting your changes right?

gothub commented 2 years ago

Yes, I would expect DNS resolution just to work. So the concern I have is the root cause hasn't been identified yet.

I'm not familiar enough with DNS to know all the components that are involved with name resolution, so this list may not be accurate or complete, but one of these may be involved:

Regarding the k8s DNS service resolution, I'm working on a way to have helm query k8s and inject the DNS server IP into installations for our services. The DNS service IP appears to be determined from the k8s api server:

$ ps -aef | grep kube
kube-apiserver ...  ----service-cluster-ip-range=10.96.0.0/12 ...

I don't think that the DNS IP changes unless the kubeadm configured k8s installation changes. If it did change spontaneously, i.e. during a service restart, then all running containers might not resolve correctly.

If anyone identifies the root cause and/or an alternative fix, then we can implement that.

amoeba commented 2 years ago

I think it's helpful to have some more info here so I looked at Slinky. It's resolv.conf looks similar to the ones we find in pods in the metadig namespace:

root@worker-default-84894d4549-zpfww:/web# cat /etc/resolv.conf
nameserver 10.96.0.10
search slinky.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5

And DNS resolution via local names works. Here's me resolving the name "redis" with Python since that's all I have on this container:

>>> socket.getaddrinfo("redis", 0)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.97.173.162', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.97.173.162', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.97.173.162', 0))]

Like we're seeing on metadig, resolving the FQDN doesn't work:

>>> socket.getaddrinfo("redis.svc.cluster.local", 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname