cloud-py-api / app_api

Nextcloud AppAPI
https://apps.nextcloud.com/apps/app_api
GNU Affero General Public License v3.0
66 stars 7 forks source link

After deployment of ExApp "Test Deploy" (nc_app_test-deploy) returns/shows: "Heartbeat check failed" and "Healtchecking" #300

Open architectonio opened 3 months ago

architectonio commented 3 months ago

Describe the bug

After having deployed the ExApp "Test Deploy", the NextCloud External App Admin Interface shows a "Healthchecking"infinite loop as well as "Heartbeat check failed"

Steps/Code to Reproduce

Deploy the "Test Deploy" on NextCloud

Expected Results

Deployed without any issue

Actual Results

NextCloud External App Admin Interface shows a "Healthchecking"infinite loop as well as "Heartbeat check failed"

Setup configuration

Software

Hardware

result of: docker logs nc_app_test-deploy Started INFO: Started server process [1] INFO: Waiting for application startup. TRACE: ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}} TRACE: ASGI [1] Receive {'type': 'lifespan.startup'} TRACE: ASGI [1] Send {'type': 'lifespan.startup.complete'} INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23000 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. TRACE: ASGI [1] Receive {'type': 'lifespan.shutdown'} TRACE: ASGI [1] Send {'type': 'lifespan.shutdown.complete'} TRACE: ASGI [1] Completed INFO: Application shutdown complete. INFO: Finished server process [1] Started INFO: Started server process [1] INFO: Waiting for application startup. TRACE: ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}} TRACE: ASGI [1] Receive {'type': 'lifespan.startup'} TRACE: ASGI [1] Send {'type': 'lifespan.startup.complete'} INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23000 (Press CTRL+C to quit)

result of: docker volume inspect nc_app_test-deploy_data [ { "CreatedAt": "2024-05-28T10:36:06+02:00", "Driver": "local", "Labels": null, "Mountpoint": "/var/lib/docker/volumes/nc_app_test-deploy_data/_data", "Name": "nc_app_test-deploy_data", "Options": null, "Scope": "local" } ]

bigcat88 commented 2 months ago

This is the Public IP Address I got (for today) by my ISP, on which points my domain

I understand that, but that DNS names should be resolved to the local addresses(docker should do that) and not your public address. They can be resolved to your public address only when "network=host" is set up, which should not be done for this type of setup(when Nextcloud and ExApps are on the same host).

architectonio commented 2 months ago

VerifyConnection button works in your case only for the reason that you specified "/var/run/docker.sock" in Host

I suggest to remove this daemon, deploy this container https://github.com/cloud-py-api/docker-socket-proxy and create daemon with Host: nextcloud-appapi-dsp:2375 after that.

After that "VerifyConnection" button will try to connect to nextcloud-appapi-dsp:2375 which will fail, I guess...

Something is resolving all those DNS names in your system to 84.170.215.125 - you need to find what is that.

I created the network "nextcloud-aio" and this is what a "docker inspect" gives back

docker network inspect nextcloud-aio

[
    {
        "Name": "nextcloud-aio",
        "Id": "5d26f704ae6c26bd2eb55e8b2389b040d36c19caec5e392e47732d9f795c9e64",
        "Created": "2024-06-14T10:08:31.757855622+02:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {}
    }
]

I removed the daemon and redeployed as "nextcloud-appapi-dsp:2375".

"docker ps" shows it up and running

3d02fb1d6b04   ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release   "/bin/bash start.sh"     2 weeks ago   Up 18 minutes (healthy)   0.0.0.0:2375->2375/tcp, :::2375->2375/tcp     nextcloud-appapi-dsp

A "nmap nextcloud-appapi-dsp -p2375" gives back:

PORT     STATE    SERVICE
2375/tcp filtered docker

And a ping shows my (currently) public IP Address "ping nextcloud-appapi-dsp"

PING nextcloud-appapi-dsp.architectonio.net (93.224.198.141) 56(84) bytes of data.
64 bytes from p5de0c68d.dip0.t-ipconnect.de (93.224.198.141): icmp_seq=1 ttl=63 time=1.29 ms
64 bytes from p5de0c68d.dip0.t-ipconnect.de (93.224.198.141): icmp_seq=2 ttl=63 time=1.86 ms
^C64 bytes from 93.224.198.141: icmp_seq=3 ttl=63 time=1.28 ms

And NextCloud ExtApp Dashboard shows: "All ExApps are up-to-date. Default Deploy daemon is not accessible "

I do not know is something is wrong on my Server Network Configuration, however everything else just runs smoothly, without any noticeable issue.

architectonio commented 2 months ago

What I also noticed is the fact that both "Test Deploy" and "Context Chat Backend" containers have no port exposed while all other containers are exposing a port. It is so OK?

docker ps 05fb1eb96a0b ghcr.io/nextcloud/context_chat_backend:2.1.1 "python3 main.py" About an hour ago Up 17 seconds nc_app_context_chat_backend c35572f9cd25 ghcr.io/cloud-py-api/test-deploy-cuda:release "python3 main.py" 11 hours ago Up 18 seconds (healthy) nc_app_test-deploy c9ded5aa33ae ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release "/bin/bash start.sh" 12 hours ago Up 11 seconds (healthy) 0.0.0.0:2375->2375/tcp, :::2375->2375/tcp nextcloud-appapi-dsp 4d936805fe6e localai/localai:master-aio-gpu-nvidia-cuda-12 "/aio/entrypoint.sh" 2 weeks ago Up 14 seconds (health: starting) 0.0.0.0:28890->8080/tcp, :::28890->8080/tcp local-ai 1b3b3efb8d7f collabora/code "/start-collabora-on…" 2 weeks ago Up 18 seconds 0.0.0.0:9980->9980/tcp, :::9980->9980/tcp collabora-code

Another point I do not really catch is the "nextcloud-aio" network. I created it and associated to the "Docker Socket Proxy" Container, however it is to me not clear why I cannot use another docker bridged network, since my Nextcloud Installation isn't an AIO but a bare installation and all other containers, including COLLABORA-CODE and LOCAL-AI works very well.

architectonio commented 2 months ago

Any news on this?

ericmail84 commented 2 months ago

I guess this is a reason: "NetworkMode": "bridge"

Ok, we need to move those Note about bridge from here: https://cloud-py-api.github.io/app_api/DeployConfigurations.html

image

to somewhere else to be more visible...

Am I reading this correctly that the docker socket proxy cannot be on a remote host? Test deploy seems to fail for me, much as the original post here, because it cannot resolve the name http//:test-deploy:23000

architectonio commented 2 months ago

I don't know if DSP must run on the local host, however my DSP runs on the local host and I tried with "host", "bridge" and also "nextcloud-aio". The issue remains the same, "Test Deploy" and "Context Chat Backend" are deployed but not reachable. I also noticed that "Context Chat Backend" restarts every few seconds (by observing the results of "watch -n 0.5 docker ps" ).

andrey18106 commented 2 months ago

@architectonio

Another point I do not really catch is the "nextcloud-aio" network.

It was mentioned as the assumption that you are using Nextcloud AIO - which has this custom network created for the AIO containers.

I don't know if DSP must run on the local host

The purpose of the DSP - is to provide a secure access for AppAPI to docker via network, it can be local or remote.

I also noticed that "Context Chat Backend" restarts every few seconds (by observing the results of "watch -n 0.5 docker ps" ).

Is there any logs or error that can give us a hint? Is there any errors related in system logs from docker (journalctl -u docker) or from Context Chat Backend container?

For now I can't say more that was said before on how to investigate networking issues, since the daemon connection is fine and the deployment working, the issue is only in communication part between ExApp and NC, which is likely some specifics of certain system setup. I'll back to you as soon as find something.

ericmail84 commented 2 months ago

Well, on my end, for example, health check seems to be looking for http://test-deploy:23000 and http://context-chat-backend:23000 which it will never find because those are not in the same network as Nextcloud (they are on the remote host at the IP address I gave Nextcloud when I set up the socket proxy). So I’m not sure what can be done to have the health check seek the right thing at the right location. It might help to be able to better direct where the health check looks.

On Tue, Jul 2, 2024 at 8:40 AM Andrey Borysenko @.***> wrote:

@architectonio https://github.com/architectonio

Another point I do not really catch is the "nextcloud-aio" network.

It was mentioned as the assumption that you are using Nextcloud AIO - which has this custom network created for the AIO containers.

I don't know if DSP must run on the local host

The purpose of the DSP - is to provide a secure access for AppAPI to docker via network, it can be local or remote.

I also noticed that "Context Chat Backend" restarts every few seconds (by observing the results of "watch -n 0.5 docker ps" ).

Is there any logs or error that can give us a hint? Is there any errors related in system logs from docker (journalctl -u docker) or from Context Chat Backend container?

For now I can't say more that was said before on how to investigate networking issues, since the daemon connection is fine and the deployment working, the issue is only in communication part between ExApp and NC, which is likely some specifics of certain system setup. I'll back to you as soon as find something.

— Reply to this email directly, view it on GitHub https://github.com/cloud-py-api/app_api/issues/300#issuecomment-2203198873, or unsubscribe https://github.com/notifications/unsubscribe-auth/APK5CFU6OLFFQNMCRGXGRITZKKUV7AVCNFSM6AAAAABI6PT7SOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGE4TQOBXGM . You are receiving this because you commented.Message ID: @.***>

architectonio commented 2 months ago

@andrey18106 @andrey18106 Thank you for your reply, I mentioned earlier that I do not use and never have used NextCloud AIO. My installation is on the Server (Debian, with MariaDB, Apache, PHP and so on).

Is there any logs or error that can give us a hint? Is there any errors related in system logs from docker (journalctl -u docker) or from Context Chat Backend container?

As I wrote before, docker works perfectly, with no network or other issue. I currently have about ten applications on docker including LocalAI (CUDA), Collabora CODE, Home Assistant, Libretranslate, SearxNG, and so on.

A "docker ps" gives back: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 123456789012 ghcr.io/cloud-py-api/test-deploy-cuda:release "python3 main.py" 45 seconds ago Up 43 seconds (healthy) nc_app_test-deploy 1234567890ab localai/localai:master-aio-gpu-nvidia-cuda-12 "/aio/entrypoint.sh" 4 days ago Up 4 days (healthy) 0.0.0.0:28890->8080/tcp, :::28890->8080/tcp local-ai 1234567890cd ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release "/bin/bash start.sh" 4 days ago Up 4 days (healthy) 0.0.0.0:2375->2375/tcp, :::2375->2375/tcp nextcloud-appapi-dsp ...................................

Notice that the Test Deploy has no network or port

Here the "docker network list" NETWORK ID NAME DRIVER SCOPE a9421922f6d8 bridge bridge local 371bcd681096 host host local c92d51732e32 localai-webui_default bridge local 5d26f704ae6c nextcloud-aio bridge local 62ccc5fcbaa2 none null local

architectonio commented 2 months ago

This is the running DSP (I removed Test-Deploy and Context_Chat Backend) which is reachable

ExApps installed: 0 Name: docker_socket_proxy Protocol: http Host: 127.0.0.1:2375 Deploy config Docker network: bridge Nextcloud URL: https://nextcloud.mydomain.net HaProxy password: 12345678 GPUs support: true Compute device: CUDA (NVIDIA

ericmail84 commented 2 months ago

I think they suggested not to use bridge because bridge won't look things up by container name. I set mine to master_default, but no difference in the behavior.

architectonio commented 2 months ago

I tried a lot of possible combinations, all without success.....I guess I invested at least 40 hours in testing. I have now deleted everything and wait until NextCloud releases documentation that explains what to do in a clear way that works.

gitwittidbit commented 2 weeks ago

Same issue here. I have NC running in a VM (direct install, no docker). And I have docker for running this ExApp stuff.

The daemon connection test is successful but the deployment test fails after a long time during the heartbeat check.

No idea what else to try.

architectonio commented 2 weeks ago

Same issue here. I have NC running in a VM (direct install, no docker). And I have docker for running this ExApp stuff.

The daemon connection test is successful but the deployment test fails after a long time during the heartbeat check.

No idea what else to try.

I had exactly the same issue.

bigcat88 commented 2 weeks ago

Same issue here. I have NC running in a VM (direct install, no docker). And I have docker for running this ExApp stuff.

The daemon connection test is successful but the deployment test fails after a long time during the heartbeat check.

No idea what else to try.

Please create a separate issue with describing of your configuration. Without NC logs, container info/logs and information about setup we can't do much.

I had exactly the same issue.

Have you tried with the latest version 3.1.0 (where we fixed a critical bug with APCu), heartbeat still didn't work?

If you tried and it still didn't work, as an option I can offer if you have the opportunity to give VPN access to the test environment where you can't do it, and we'll try to figure out what the reason might be.

But with version 3.1.0 everything has already worked for most people, I hope that we can help you too.

architectonio commented 2 weeks ago

I had exactly the same issue.

Have you tried with the latest version 3.1.0 (where we fixed a critical bug with APCu), heartbeat still didn't work?

If you tried and it still didn't work, as an option I can offer if you have the opportunity to give VPN access to the test environment where you can't do it, and we'll try to figure out what the reason might be.

But with version 3.1.0 everything has already worked for most people, I hope that we can help you too.

Yes I have tried with the latest version 3.1.0, however I just used the already deployed Docker Socket Proxy and I do not know if it affects in any way the AppAPI. Tomorrow I am going to re-deploy the DSP and watch what happens.

Would be worth deleting the NextCloud Assistant, NextCloud Assistant Context Chat nad AppAPI and then reinstall again?

gitwittidbit commented 2 weeks ago

But with version 3.1.0 everything has already worked for most people, I hope that we can help you too.

Yes, I'm on 3.1.0 (I also updated NC to 29.0.8).

One thing I noticed is that the nextcloud-appapi-dsp container becomes unhealthy relatively quickly. Not sure, if this has anything to do with the issue? (I downloaded the most recent image and updated the container but it still becomes unhealthy a minute after starting or so)

bigcat88 commented 2 weeks ago

You have a docker-socket-proxy address where it is listens.

Look in the DB which port is assigned to the test-deploy application in the oc_ex_apps table.

Try to do curl 'http://{docker-socket-proxy-address}:{test-deploy-port}/heartbeat' from the Nextcloud instance.

If you use https you need to add authentification for request with -u app_api_haproxy_user:{your_haproxy_password}

This is literally what AppAPI does on heartbeat.

To not this issue longer(it is already 65+ messages) - please create a separate issue with posted configs.

architectonio commented 2 weeks ago

You have a docker-socket-proxy address where it is listens.

Look in the DB which port is assigned to the test-deploy application in the oc_ex_apps table.

Try to do curl 'http://{docker-socket-proxy-address}:{test-deploy-port}/heartbeat' from the Nextcloud instance.

If you use https you need to add authentification for request with -u app_api_haproxy_user:{your_haproxy_password}

This is literally what AppAPI does on heartbeat.

To not this issue longer(it is already 65+ messages) - please create a separate issue with posted configs.

This is what i get with https: curl https://127.0.0.1:2375/ -u app_api_haproxyuser:mytestpassword curl: (35) OpenSSL/3.0.13: error:0A00010B:SSL routines::wrong version number

And this with http: curl http://127.0.0.1:2375/ -u app_api_haproxyuser:mytestpassword

401 Unauthorized

You need a valid user and password to access this content.

The DSP was deployed in this way: docker run -v /var/run/docker.sock:/var/run/docker.sock -e NC_HAPROXY_PASSWORD="mytestpassword" --restart always --name nextcloud-appapi-dsp -h nextcloud-appapi-dsp --net nextcloud-aio -p 2375:2375 --privileged -d ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release

A docker ps shows: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c011616e7873 ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release "/bin/bash start.sh" 6 minutes ago Up 6 minutes (healthy) 0.0.0.0:2375->2375/tcp, :::2375->2375/tcp nextcloud-appapi-dsp

andrey18106 commented 2 weeks ago

@architectonio Please correct the name of the user to app_api_haproxy_user and try again. Note: for HTTPS Docker Socket Proxy you can't use 127.0.0.1 host, in your case the error for https additionally means that the Docker Socket Proxy wasn't set up with SSL enabled (it enables if /certs/cert.pem is mounted in container during startup).

architectonio commented 2 weeks ago

curl http://127.0.0.1:2375/ -u app_api_haproxy_user:mytestpassword

403 Forbidden

Request forbidden by administrative rules.
andrey18106 commented 2 weeks ago

curl http://127.0.0.1:2375/ -u app_api_haproxy_user:mytestpassword

403 Forbidden

Request forbidden by administrative rules.

There is no route in your request, it's not allowed, so the response is correct, and auth is passed.

architectonio commented 2 weeks ago

I assume this means that the AppAPI and everything that is deployed should work...

architectonio commented 2 weeks ago

Unfortunately the issue persists. Both "Context Chat Backend" and "Test Deploy" Apps, stuck by Healthchecking.

A docker ps shows: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 110c19421d84 ghcr.io/nextcloud/context_chat_backend:2.2.1 "python3 main.py" About a minute ago Up About a minute nc_app_context_chat_backend cf74179d5b31 ghcr.io/cloud-py-api/test-deploy:release-cuda "python3 main.py" 8 minutes ago Up 7 minutes (healthy) nc_app_test-deploy

And Nextcloud "You Apps" Dashboard shows both Apps with a Healtchecking loop.