Closed amoeba closed 2 years ago
I ran through https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ and found something interesting. When I exec into one of the pods that can't see Postgres and dump its resolv.conf,
; kubectl exec --namespace=metadig -it metadig-controller-7978fd8bb7-f94mp -- /bin/bash
bash-4.4# cat /etc/resolv.conf
nameserver 10.96.0.10
search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5
I get 10.96.0.10
. That does appear to be the right cluster-local IP address,
kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 2y21d
But the above pod can't ping 10.96.0.10:
bash-4.4# ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
...100% packet loss
No dig
available in that pod, but it looks like it can communicate with the DNS server via the limited nslookup
command:
bash-4.4# nslookup postgres 10.96.0.10
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: postgres
Address 1: 10.109.33.23 postgres.metadig.svc.cluster.local
Thanks for looking @nickatnceas. nslookup
was what I ran before but with the name postgres.metadig.svc.cluster.local
instead of postgres
. It's not really clear to me what the difference is between postgres
and postgres.metadig.svc.cluster.local
.
This problem is also occurring on the dev cluster. Connections from metadig containers to both RabbitMQ and Postgres services are failing. It looks like k8s core-dns is providing the full DNS name, but it's not clear to me that the name is being resolved correctly. I ran the dns-utils tests that you mentioned and didn't see errors. k8s has been upgraded recently, but calico networking was not. So, I upgraded calico to v3.22.1 on dev k8s which did not resolve the problem. I'll continue to look for problems with calico and/or core-dns.
BTW - we had a problem with calico awhile ago related to how it identifies nodes in the k8s cluster. This doesn't seem to be the problem, but if anyone is interested, the issue was written up here https://github.com/NCEAS/metadig-engine/issues/288.
On Thu, Mar 10, 2022 at 4:37 PM Bryce Mecum @.***> wrote:
Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from edu.ucsb.nceas.mdqengine.store.DatabaseStore. Example stack trace
org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.
(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) 20220310-23:56:51: [ERROR]: org.postgresql.util.PSQLException: The connection attempt failed. [edu.ucsb.nceas.mdqengine.store.DatabaseStore] at edu.ucsb.nceas.mdqengine.store.DatabaseStore. (DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream. (PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more edu.ucsb.nceas.mdqengine.exception.MetadigStoreException: Unable to create the database store. at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:90) at edu.ucsb.nceas.mdqengine.store.DatabaseStore. (DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection. (PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) ... 50 more Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream. (PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more The root cause for the failed connection looks like the message at the bottom of the above output,
Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local.
; kubectl get pods --namespace=metadig NAME READY STATUS RESTARTS AGE metadig-controller-7978fd8bb7-f94mp 1/1 Running 0 15d metadig-scheduler-7b97494fc6-tfjr2 1/1 Running 0 2d1h metadig-scorer-6dc58c6b7d-m68xq 1/1 Running 0 15d metadig-worker-76c5884885-78n8r 1/1 Running 0 15d metadig-worker-76c5884885-8zt9j 1/1 Running 0 15d metadig-worker-76c5884885-bk6mz 1/1 Running 0 15d metadig-worker-76c5884885-jct4r 1/1 Running 0 15d metadig-worker-76c5884885-jfmnc 1/1 Running 0 15d metadig-worker-76c5884885-kcl5t 1/1 Running 0 15d metadig-worker-76c5884885-kszbs 1/1 Running 0 15d metadig-worker-76c5884885-m8sfj 1/1 Running 0 15d metadig-worker-76c5884885-rx566 1/1 Running 2 (15d ago) 15d metadig-worker-76c5884885-zf5r2 1/1 Running 2 (15d ago) 15d postgres-78477d4df8-h628z 2/2 Running 0 15d rabbitmq-fc49dbd56-pbhm9 1/1 Running 0 15d
; kubectl get services --namespace=metadig NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE metadig-controller ClusterIP 10.103.244.83
8080/TCP 37d postgres ClusterIP 10.109.33.23 5432/TCP,6432/TCP 65d rabbitmq ClusterIP 10.108.196.79 5672/TCP,15672/TCP 65d — Reply to this email directly, view it on GitHub https://github.com/DataONEorg/k8s-cluster/issues/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5VEW6COQWMM5M5B4JP7TU7KIUTANCNFSM5QOGFBZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were assigned.Message ID: @.***>
-- Peter Slaughter, Software Engineer National Center for Ecological Analysis and Synthesis Santa Barbara, CA 93101
On prod k8s, if I login to the metadig-controller pod/container, I'm able to ping the metadig postgres pod/container IP, which is setup by calico:
PING 192.168.181.142 (192.168.181.142): 56 data bytes 64 bytes from 192.168.181.142: seq=0 ttl=62 time=1.874 ms 64 bytes from 192.168.181.142: seq=1 ttl=62 time=1.126 ms
This tells me that the calico managed overlay network is working.
If I try to ping anything on the k8s 'service' network, I get no response. For example, pinging the kube-dns service from metadig-controller pod/container:
PING 10.108.92.153 (10.108.92.153): 56 data bytes
Also, if I try to ping using the DNS name for the postgres service, the command hangs.
Connectivity was working on Wed. 3/9/22, so what could have changed. I didn't perform any upgrades last week.
Also, here is the /etc/resolv.conf from the metadig-controller container: /usr/local/tomcat # cat /etc/resolv.conf nameserver 10.96.0.10 search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org options ndots:5
Any ideas?
On Thu, Mar 10, 2022 at 4:37 PM Bryce Mecum @.***> wrote:
Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from edu.ucsb.nceas.mdqengine.store.DatabaseStore. Example stack trace
org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.
(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) 20220310-23:56:51: [ERROR]: org.postgresql.util.PSQLException: The connection attempt failed. [edu.ucsb.nceas.mdqengine.store.DatabaseStore] at edu.ucsb.nceas.mdqengine.store.DatabaseStore. (DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream. (PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more edu.ucsb.nceas.mdqengine.exception.MetadigStoreException: Unable to create the database store. at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:90) at edu.ucsb.nceas.mdqengine.store.DatabaseStore. (DatabaseStore.java:54) at edu.ucsb.nceas.mdqengine.store.StoreFactory.getStore(StoreFactory.java:16) at edu.ucsb.nceas.mdq.rest.RunsResource.getRun(RunsResource.java:66) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:660) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:798) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection. (PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at edu.ucsb.nceas.mdqengine.store.DatabaseStore.init(DatabaseStore.java:85) ... 50 more Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.postgresql.core.PGStream. (PGStream.java:69) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:152) ... 57 more The root cause for the failed connection looks like the message at the bottom of the above output,
Caused by: java.net.UnknownHostException: postgres.metadig.svc.cluster.local.
; kubectl get pods --namespace=metadig NAME READY STATUS RESTARTS AGE metadig-controller-7978fd8bb7-f94mp 1/1 Running 0 15d metadig-scheduler-7b97494fc6-tfjr2 1/1 Running 0 2d1h metadig-scorer-6dc58c6b7d-m68xq 1/1 Running 0 15d metadig-worker-76c5884885-78n8r 1/1 Running 0 15d metadig-worker-76c5884885-8zt9j 1/1 Running 0 15d metadig-worker-76c5884885-bk6mz 1/1 Running 0 15d metadig-worker-76c5884885-jct4r 1/1 Running 0 15d metadig-worker-76c5884885-jfmnc 1/1 Running 0 15d metadig-worker-76c5884885-kcl5t 1/1 Running 0 15d metadig-worker-76c5884885-kszbs 1/1 Running 0 15d metadig-worker-76c5884885-m8sfj 1/1 Running 0 15d metadig-worker-76c5884885-rx566 1/1 Running 2 (15d ago) 15d metadig-worker-76c5884885-zf5r2 1/1 Running 2 (15d ago) 15d postgres-78477d4df8-h628z 2/2 Running 0 15d rabbitmq-fc49dbd56-pbhm9 1/1 Running 0 15d
; kubectl get services --namespace=metadig NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE metadig-controller ClusterIP 10.103.244.83
8080/TCP 37d postgres ClusterIP 10.109.33.23 5432/TCP,6432/TCP 65d rabbitmq ClusterIP 10.108.196.79 5672/TCP,15672/TCP 65d — Reply to this email directly, view it on GitHub https://github.com/DataONEorg/k8s-cluster/issues/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5VEW6COQWMM5M5B4JP7TU7KIUTANCNFSM5QOGFBZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were assigned.Message ID: @.***>
-- Peter Slaughter, Software Engineer National Center for Ecological Analysis and Synthesis Santa Barbara, CA 93101
Hey @nick - when you have a moment, could you have a look at the entry above?
In particular, do you have any insight why k8s DNS may not be resolving names? In June last year, we ran into a problem where Ceph IP network (10.0.3.x) was confusing the calico system to try and add those to the k8s overlay network. We re-configured calico to ignore those (described here).
Could this new issue be related to a recent change to Ceph? Any additional ideas on how to debug k8s DNS?
This still appears to be a problem with core-dns. If I login to the metadig-controller pod, the following DNS query works:
# nslookup rabbitmq
nslookup: can't resolve '(null)': Name does not resolve
Name: rabbitmq
Address 1: 10.109.162.239 rabbitmq.metadig.svc.cluster.local
The names rabbitmq.metadig
and rabbitmq.metadig.svc
work.
However the following fully qualified DNS 'rabbitmq.metadig.svc.cluster.local` does not resolve, but should:
# nslookup rabbitmq.metadig.svc.cluster.local
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'rabbitmq.metadig.svc.cluster.local': Name does not resolve
Also, external names do not resolve, such as cn.dataone.org
kubelet
creates an /etc/resolv.conf
in every pod/container that it creates. For metadig namespace pods, the resolve.conf
is:
nameserver 10.96.0.10
search metadig.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5
Removing the dataone.org
from the file appears to have fixed the problem. For example, after the edit, all these names can be resolved (and not before):
www.google.com
cn.dataone.org
metadig-controller.metadig.svc.cluster.local
knb.ecoinformatics.org
There is a way to tell kubelet
how to create the resolv.conf file, I just have to look that up and test it on the
dev k8s cluster.
So, it has yet to be determined why this problem is happening now, but at least it should be easy to fix. This fix will be added to the k8s-config docs.
Customizing k8s dns is described here
One way to cleanly do this is to modify the deployment dnsPolicy
. Here is the YAML version:
dnsPolicy: None
dnsConfig:
nameservers:
- 10.96.0.10
searches:
- metadig.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "5"
With helm, it should be possible to query and inject the nameserver
IP, which is the IP of the core-dns service.
Superceded and resolved in issue https://github.com/NCEAS/metadig-engine/issues/312
Thanks for tracking this down @gothub. I'm a little concerned about the fix, though.
Mainly, shouldn't our CNI plugin just be handling this without any config? That hard-coded cluster-local IP address worries me and so does having to apply this bit of config to every service we run in the cluster. Am I interpreting your changes right?
Yes, I would expect DNS resolution just to work. So the concern I have is the root cause hasn't been identified yet.
I'm not familiar enough with DNS to know all the components that are involved with name resolution, so this list may not be accurate or complete, but one of these may be involved:
rabbitmq.metadig.svc.cluster.local
Regarding the k8s DNS service resolution, I'm working on a way to have helm query k8s and inject the DNS server IP into installations for our services. The DNS service IP appears to be determined from the k8s api server:
$ ps -aef | grep kube
kube-apiserver ... ----service-cluster-ip-range=10.96.0.0/12 ...
I don't think that the DNS IP changes unless the kubeadm
configured k8s installation changes. If it did change spontaneously, i.e. during a service restart, then all running containers might not resolve correctly.
If anyone identifies the root cause and/or an alternative fix, then we can implement that.
I think it's helpful to have some more info here so I looked at Slinky. It's resolv.conf
looks similar to the ones we find in pods in the metadig namespace:
root@worker-default-84894d4549-zpfww:/web# cat /etc/resolv.conf
nameserver 10.96.0.10
search slinky.svc.cluster.local svc.cluster.local cluster.local dataone.org
options ndots:5
And DNS resolution via local names works. Here's me resolving the name "redis" with Python since that's all I have on this container:
>>> socket.getaddrinfo("redis", 0)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.97.173.162', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.97.173.162', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.97.173.162', 0))]
Like we're seeing on metadig, resolving the FQDN doesn't work:
>>> socket.getaddrinfo("redis.svc.cluster.local", 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname
Requests for Metadig quality reports, like https://docker-ucsb-4.dataone.org:30443/quality/runs/knb.suite.1/doi:10.5063/F1K9360F?_=1646952626669 are returning 500s since at least Thursday morning (2022-03-10). Pod logs are reporting connection attempt failures initiated from
edu.ucsb.nceas.mdqengine.store.DatabaseStore
.Example stack trace
``` org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.The root cause for the failed connection looks like the message at the bottom of the above output,