Deployment stuck at 0% - Githubissues

red3333 commented 6 months ago

I followed the tutorial on my Nextcloud 28: installed exApp, created a daemon worker, added AiImageGeneratorBot app. After several hours, the progress remained stuck at "0% deploying", then I finished getting a heartbeat failure. I checked my docker install:

a container has been created and started: "nc_app_ai_image_generator_bot"
there doesn't seem to be any activity in it (no memory or disk usage increase, no CPU usage)
logs seem to be ok: The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache(). 0it [00:00, ?it/s] INFO: Started server process [1] INFO: Waiting for application startup. TRACE: ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}} TRACE: ASGI [1] Receive {'type': 'lifespan.startup'} TRACE: ASGI [1] Send {'type': 'lifespan.startup.complete'} INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:23000 (Press CTRL+C to quit)

Not sure if it is important, but I don't have any GPU on that server, but plenty of RAM an CPU cores. Anyone with an idea ?

bigcat88 commented 6 months ago

Can you describe you configuration? Where is installed Nextcloud, where is docker located? Also can you ping from docker container nextcloud instance?

Do you use Docker Socket Proxy or not?

red3333 commented 6 months ago

All Machines are on a ESXi hypervisor:

Nextcloud running on apache on 1st machine (connected to Internet)
All containers running on a 2nd machine behind the 1st one (local network) (64GB ram, 28 Cores, no GPU).
I can ping Nextcloud machine from docker
I can curl my Nextcloud main page from docker and from inside the nc_app_ai_image_generator_bot container
I'm using http connexion between exApp and its docker_socket_proxy container worker; it was created with the following command: docker run -e NC_HAPROXY_PASSWORD="$password" \ -p 2375:2375 \ -v /var/run/docker.sock:/var/run/docker.sock \ --name nextcloud-appapi-dsp -h nextcloud-appapi-dsp \ --restart unless-stopped --privileged -d ghcr.io/cloud-py-api/nextcloud-appapi-dsp:release

bigcat88 commented 6 months ago

I can curl my Nextcloud main page from docker and from inside the nc_app_ai_image_generator_bot container

Strange, then It should work.. I assume you can curl that url that is inside Daemon config when you create one? Can you show part of that url, maybe it is not a valid one..

red3333 commented 6 months ago

I switched to wget for my tests as the url is https : wget https://my.domain.name/index.php gives me the main page from the docker machine, the Daemon container and the nc_app_ai_image_generator_bot container. The same url is configured in the exApp Daemon configuration (and is put in the NEXTCLOUD_URL= env variable of the generator bot)

Also tried the https proxy daemon. Same results as previous.

red3333 commented 6 months ago

So I found a problem with cloud-py-api/docker-socket-proxy not forwarding request to app (eg. the /heartbeat was seen in docker-socket-proxy logs, but not in nc_app_ai_image_generator_bot container). I don't know the exact reason, but replacing localhost by 127.0.0.1 in haproxy_ex_apps.cfg solved the problem.

now, my nc_app_ai_image_generator_bot has received the /heartbeat and the /init requests : TRACE: 127.0.0.1:54674 - HTTP connection made TRACE: 127.0.0.1:54674 - ASGI [2] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 23000), 'client': ('127.0.0.1', 54674), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'GET', 'path': '/heartbeat', 'raw_path': b'/heartbeat', 'query_string': b''} TRACE: 127.0.0.1:54674 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'} INFO: 127.0.0.1:54674 - "GET /heartbeat HTTP/1.1" 200 OK TRACE: 127.0.0.1:54674 - ASGI [2] Send {'type': 'http.response.body', 'body': '<15 bytes>'} TRACE: 127.0.0.1:54674 - ASGI [2] Completed TRACE: 127.0.0.1:54674 - ASGI [3] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 23000), 'client': ('127.0.0.1', 54674), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/init', 'raw_path': b'/init', 'query_string': b''} TRACE: 127.0.0.1:54674 - ASGI [3] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'} INFO: 127.0.0.1:54674 - "POST /init HTTP/1.1" 200 OK TRACE: 127.0.0.1:54674 - ASGI [3] Send {'type': 'http.response.body', 'body': '<2 bytes>'} TRACE: 127.0.0.1:54674 - HTTP connection lost

... but still no visible activity on nc_app_ai_image_generator_bot : no CPU usage, no RAM usage (actually, some RAM usage, but in OS cache, may be related to other containers), no other message. In Nextcloud, the app is now at "0% initialization", status: "initialization timed out".

red3333 commented 6 months ago

I tracked the error a bit further:

"/heartbeat" works good.

The "/init" request returns a successfull "OK". But then the set_init_status generates traceback: ERROR: Exception in ASGI application Traceback (most recent call last): [...] nc_py_api._exceptions.NextcloudException: [401] Unauthorized <request: PUT /ocs/v1.php/apps/app_api/apps/status/ai_image_generator_bot> models--stabilityai--sdxl-turbo is therefore not downloaded. I then achieved to download the models from outside the container, and put the result in the persistent storage of the container.

After a (long) time, the app becomes "Initialization timeout 0%" and keeps waiting... ...but the Nextcloud app_api sends a "/enabled?enabled=1" request, which returns a successful "OK". In Nextcloud, the app remains "Initialization timeout", but the bot can be added to the Talk app, and requests (eg. @image cinematic portrait of fluffy cat with black eyes) successfully generate images.

cloud-py-api / ai_image_generator_bot

Deployment stuck at 0% #6