VertaAI / modeldb

Open Source ML Model Versioning, Metadata, and Experiment Management
Apache License 2.0
1.7k stars 286 forks source link

Installing via helm - backend unable to connect to postgres #848

Open yixu34 opened 4 years ago

yixu34 commented 4 years ago

I've just pulled master and I'm on f8780ecace643dcba180d670d2f7dc5d68451a7f (this was after the helm chart split, which I noticed in some of the 2.x tagged versions previously). I tried installing on our k8s cluster via helm, and the backend container of the backend service is giving an error:

{"thread":"main","level":"DEBUG","loggerName":"ai.verta.modeldb.utils.ModelDBHibernateUtil","message":"ModelDBHibernateUtil getSessionFactory() retrying for DB connection after 2560 millisecond ","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","instant":{"epochSecond":1592876745,"nanoOfSecond":869000000},"threadId":1,"threadPriority":5,"hostName":"modeldb-staging-f8780e-backend-0","kubernetes.podIP":""}

{"thread":"main","level":"WARN","loggerName":"ai.verta.modeldb.utils.ModelDBHibernateUtil","message":"ModelDBHibernateUtil checkDBConnection() got error ","thrown":{"commonElementCount":0,"localizedMessage":"The connection attempt failed.","message":"The connection attempt failed.","name":"org.postgresql.util.PSQLException","cause":{"commonElementCount":19,"localizedMessage":"modeldb-postgresql","message":"modeldb-postgresql","name":"java.net.UnknownHostException","extendedStackTrace":"java.net.UnknownHostException: modeldb-postgresql
    at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:333) ~[?:?]
    at java.net.Socket.connect(Socket.java:648) ~[?:?]
    at org.postgresql.core.PGStream.<init>(PGStream.java:75) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:91) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192) ~[postgresql-42.2.6.jar!/:42.2.6]
"},"extendedStackTrace":"org.postgresql.util.PSQLException: The connection attempt failed.
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:292) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.Driver.makeConnection(Driver.java:458) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.Driver.connect(Driver.java:260) ~[postgresql-42.2.6.jar!/:42.2.6]
    at java.sql.DriverManager.getConnection(DriverManager.java:677) ~[java.sql:?]
    at java.sql.DriverManager.getConnection(DriverManager.java:228) ~[java.sql:?]
    at ai.verta.modeldb.utils.ModelDBHibernateUtil.checkDBConnection(ModelDBHibernateUtil.java:483) [classes!/:1.0-SNAPSHOT]
    at ai.verta.modeldb.utils.ModelDBHibernateUtil.checkDBConnectionInLoop(ModelDBHibernateUtil.java:325) [classes!/:1.0-SNAPSHOT]
    at ai.verta.modeldb.utils.ModelDBHibernateUtil.createOrGetSessionFactory(ModelDBHibernateUtil.java:240) [classes!/:1.0-SNAPSHOT]
    at ai.verta.modeldb.App.initializeServicesBaseOnDataBase(App.java:363) [classes!/:1.0-SNAPSHOT]
    at ai.verta.modeldb.App.main(App.java:260) [classes!/:1.0-SNAPSHOT]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:564) ~[?:?]
    at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48) [modeldb-1.0-SNAPSHOT-client-build.jar:1.0-SNAPSHOT]
    at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) [modeldb-1.0-SNAPSHOT-client-build.jar:1.0-SNAPSHOT]
    at org.springframework.boot.loader.Launcher.launch(Launcher.java:50) [modeldb-1.0-SNAPSHOT-client-build.jar:1.0-SNAPSHOT]
    at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:58) [modeldb-1.0-SNAPSHOT-client-build.jar:1.0-SNAPSHOT]
Caused by: java.net.UnknownHostException: modeldb-postgresql
    at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:333) ~[?:?]
    at java.net.Socket.connect(Socket.java:648) ~[?:?]
    at org.postgresql.core.PGStream.<init>(PGStream.java:75) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:91) ~[postgresql-42.2.6.jar!/:42.2.6]
    at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192) ~[postgresql-42.2.6.jar!/:42.2.6]
    ... 19 more
"},"endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","instant":{"epochSecond":1592876745,"nanoOfSecond":869000000},"threadId":1,"threadPriority":5,"hostName":"modeldb-staging-f8780e-backend-0","kubernetes.podIP":""}

Is there something that's not working out of the box with the helm charts? Or did I not configure the secrets correctly? All I did was helm install modeldb-staging-f8780e . --namespace <our namespace>. Thanks!

conradoverta commented 4 years ago

Hi, @yixu34! Thanks for reaching out.

According to the logs, I can see that apparently modeldb-postgresql is not present as a service. Could you run kubectl get svc --namespace <your namespace> to check? Related to modeldb, you should see something like the below (this is from a fresh install I did this morning with the charts):

$ k get svc
NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
kubernetes                    ClusterIP   10.96.0.1       <none>        443/TCP                      10h
modeldb-backend               ClusterIP   10.96.36.191    <none>        8085/TCP,8086/TCP,3000/TCP   10h
modeldb-graphql               ClusterIP   10.96.212.125   <none>        3000/TCP                     10h
modeldb-postgresql            ClusterIP   10.96.122.16    <none>        5432/TCP                     10h
modeldb-postgresql-headless   ClusterIP   None            <none>        5432/TCP                     10h
modeldb-webapp                ClusterIP   10.96.65.187    <none>        3000/TCP                     10h

I imagine maybe something was off during the installation. If you could share the services you have, I can help you debug what happened.

yixu34 commented 4 years ago

Here are my services:

$ kubectl get svc | grep modeldb
modeldb-staging-f8780e-backend               ClusterIP   100.71.14.248    <none>        8085/TCP,8086/TCP,3000/TCP   150m
modeldb-staging-f8780e-graphql               ClusterIP   100.67.143.195   <none>        3000/TCP                     150m
modeldb-staging-f8780e-postgresql            ClusterIP   100.68.84.251    <none>        5432/TCP                     150m
modeldb-staging-f8780e-postgresql-headless   ClusterIP   None             <none>        5432/TCP                     150m
modeldb-staging-f8780e-webapp                ClusterIP   100.68.70.21     <none>        3000/TCP                     150m

I think I might see what the problem is, then: it looks like modeldb-postgresql is a hardcoded value, by way of the backend values.yaml, on lines 67 and 81. I suppose I can either install the helm chart with with --name modeldb, or change those values. Ideally, one would have a dependency on the other, or read from some common place, right?

yixu34 commented 4 years ago

Ok, changing --name modeldb for the release did the trick. I've port forwarded the webapp to my localhost:3000, and I think the only remaining problem is that on the 'Repositories' page, I see a 504 error to http://localhost:3000/api/v1/graphql/query. My guess is that there's a typo here with the double dash. It seems like it should be value: "modeldb-backend:8085" instead of value: "modeldb--backend:8085". I can contribute a PR if that's the case.

conradoverta commented 4 years ago

Nice catch. That does look like a typo and you are right that some of the names appear to be hardcoded (both in the DB reference and the graphql config). It should be based off the name of the release everywhere to avoid this situation. I'd definitely appreciate a PR with fixes!

yixu34 commented 4 years ago

Ok cool, but let me make sure I have everything working first 😅 In addition to removing the double dash, I had to move the {{- if .Values.env }} on line 36 to below line 41. I noticed that this was preventing the MDB_ADDRESS and QUERY_PATH environment variables from even being set. I then went back to the 'Repositories' page, which then fires off a request to http://localhost:3000/api/v1/graphql/query. I still see a 504, with the error being Error occured while trying to proxy to: modeldb-backend:3000/query. I'm not sure why this is happening, because the webapp redirects all api/v1/graphql/ routes to the graphQL service. The graphQL service then uses MDB_ADDRESS, which I've now (correctly?) set to modeldb-backend:8085. So I'm not sure why it's trying to forward the request to port 3000 instead. Here are the environment variables when I describe the graphQL pod, by the way:

Environment:
      MDB_ADDRESS:  modeldb-backend:8085
      QUERY_PATH:   /api/v1/graphql/query
conradoverta commented 4 years ago

Ok, I think I was able to narrow down what happened.

First, the webapp logs were a bit misleading because BACKEND_API_DOMAIN was misconfigured. https://github.com/VertaAI/modeldb/pull/853 is fixing that. It doesn't affect correctness, but it does affect the logs in the OSS component.

After that, I noticed that the graphql service was serving on port 4000, but the whole setup assumed it was on port 3000. The reason for this mismatch is that internally our services default to port 3000 for the exposed layer, but we had to move to 4000 to avoid collision on docker compose to simplify things for users. So the deployment template for graphql should have

           - name: QUERY_PATH
             value: "/api/v1/graphql/query"
+          - name: SERVER_HTTP_PORT
+            value: "3000"

which will set the correct port. This should resolve the issue you're seeing. I'm not sure how I missed that earlier. Could you double check?

I appreciate the help to debug while we open more of our platform! Our SaaS runs with a very specific configuration, so we need to reconsolidate progressively as we keep moving new parts to the open world. Our end to end CI is not fully compatible with the open version, but it's coming!