amir20 / dozzle

Realtime log viewer for docker containers.
https://dozzle.dev/
MIT License
6.6k stars 322 forks source link

Infinite load for a specific container #3385

Open FrancYescO opened 1 week ago

FrancYescO commented 1 week ago

🔍 Check for existing issues

How is Dozzle deployed?

Standalone Deployment

📦 Dozzle version

8.7.5

✅ Command used to run Dozzle

as doc

🐛 Describe the bug / provide steps to reproduce it

i have a specific container that does not load logs, is like the same infinite load as if you set a log filter that have no result, or if a container have a 0b of logs (can we fix also these 2cases and get the loading bar instead of spinner if /stream endpoint connection is fullfilled? :D)... but in this case the logs are populated just the /stream api returns empty

going backward with versions i noticed the bug popups in >=v7.0.5 while v7.0.4 correctly loads the logs of this container

for reference is what i posted here https://github.com/amir20/dozzle/issues/3222#issuecomment-2471564111

all other container on the same host shows no issue, so i assume is something in the content of logs that breaks things, (is a pretty standard nginx log output)... this is how it ends actually:

...
2023-06-01T13:42:08.858573533Z 172.18.0.13 - - [01/Jun/2023:13:42:08 +0000] "GET /api/rpc/command/get-file-libraries?file-id=3b37f927-c632-809f-8002-67f1ddd6fbf5&features=storage/pointer-map HTTP/1.1" 200 113778 "-" "Mozilla/5.0 (X11; Linux x86_64) Chrome/113.0.0.0 Safari/537.36"
2023-09-21T10:01:32.969509513Z 2023/09/21 10:01:26 [emerg] 1#1: host not found in upstream "penpot-backend" in /etc/nginx/nginx.conf:85
2023-09-21T10:01:33.372310975Z nginx: [emerg] host not found in upstream "penpot-backend" in /etc/nginx/nginx.conf:85
2023-10-03T09:02:23.994639918Z 2023/10/03 09:02:23 [emerg] 1#1: host not found in upstream "penpot-backend" in /etc/nginx/nginx.conf:85
2023-10-03T09:02:24.415120539Z nginx: [emerg] host not found in upstream "penpot-backend" in /etc/nginx/nginx.conf:85

💻 Environment

-

📸 If applicable, add screenshots to help explain your bug

No response

📜 If applicable, attach your Dozzle logs. You many need to enable debug mode. See https://dozzle.dev/guide/debugging.

No response

amir20 commented 1 week ago

I reviewed the differences between versions 7.0.4 and 7.0.5 at https://github.com/amir20/dozzle/compare/v7.0.4...v7.0.5. One thing stood out: in version 7.0.5, I follow containers and only stream logs since the container started. This change prevents old logs from appearing when containers are restarted. I doubt this is the source of the problem.

I need a way to reproduce this issue on my end.

  1. How are you starting Nginx?
  2. If you create a dummy Nginx instance, does the same issue occur?
  3. Have you confirmed that no filters, log levels, or stdout/stderr toggles are set?
  4. What does the API endpoint /stream return?

I can't think of anything else.

FrancYescO commented 1 week ago

1) is not a vanilla nginx is the one embedded in the frontend of this stack: https://github.com/penpot/penpot/blob/develop/docker/images/docker-compose.yaml runned via portainer and with restart: unless-stopped instead of always

2) No. i have ~30services on this host only this container have the issue

3) Sure.

4) After ~6s just pings :ping

amir20 commented 1 week ago

I am stumped on this one. Seems like your frontend container is based off nginxinc/nginx-unprivileged. docker run nginxinc/nginx-unprivileged seems to work with Dozzle.

Not sure what can be done to debug. Maybe completely destroy the container and recreate it.

FrancYescO commented 1 week ago

i've recreated the container and the bug (but also older logs :D) disappeared, so actually i have no way to replicate we don't know the culprit but maybe we can just close this isse....

FrancYescO commented 1 week ago

i reuse this issue but i'm not sure this is the same bug ... today with 8.8.0 i was trying to see log of a specific container but figured out that it was searching for logs of wrong-destroyed container id:

i got my container id e9fbd40f2b29 running than deployed a new image via portainer, so the container e9fbd40f2b29 got stopped and restarted with a new id 20c05b0abb96, dozzle did not switched to the new container and also after refreshing the page/clearing cache looks like was still seeing the old id e9fbd40f2b29 as running (but with no logs)

each click on the container in the sidebar result in dozzle logs says that the containerid was not found but in the webpage the "Container not found" message did not show (was just loading a few sec than shows no log, not the same as if you put a non-existent id in the URL)

recreating the container did not help this time, instead restarting dozzle container fixed the issue.... is there some sort of cache somewhere?

amir20 commented 1 week ago

Hey @FrancYescO,

I haven't seen this issue personally. Someone else created it a while back, but I can't find it right now. I'll need your help to debug. If I can't reproduce the issue, it will be difficult to fix.

Yes, Dozzle has an internal "store" that caches and stores all statistics for containers. It usually works well. I listen for specific Docker events to update the internal store.

It seems like there might be an issue with Portainer, as the other case I remember was similar.

Can you try to reproduce this state again? If you can consistently get into this state, it would be helpful to enable DEBUG logs for Dozzle and also keep docker system events open while this is happening.

Most likely, some unusual event is occurring, and Dozzle isn't refreshing its own store.

Thanks!

FrancYescO commented 1 week ago

i'll keep an eye on it to find a way to easily reproduce... surely i can say that seems is the first time i've seen the issue and my deployment was ever done in the same way

listen for specific Docker events

at what events are you listening for destroy and recreate? ... maybe there is some edge-case event the is getting ignored

amir20 commented 1 week ago

Dozzle listens to start, destroy, die, rename and few other ones.

You can find it at https://github.com/amir20/dozzle/blob/7ccc3121904e11b3230e1cd2fae9d2b08f863d52/internal/docker/container_store.go#L180-L234

FrancYescO commented 1 week ago

Btw just happened again...

I'll keep trying to isolate the issue but pretty sure is something with portainer webhook redeploy (or recreate?) methods

FrancYescO commented 6 days ago

i'm not 100% sure but this should be an extract during that bug happend .. maybe we miss kill? or the rename after kill-stop-die-destroy do some magic ?

...
2024-11-20T10:41:57.305370996+01:00 image pull registry.org/image:latest
2024-11-20T10:41:57.418400610+01:00 container create 6e5d2f7db6f5a8b2d1b0ae6a42aed70dbb7f039fb51a79599e274e2d4b5f98b4
2024-11-20T10:41:57.436166302+01:00 container kill 21a85032fee12773a36b060f769dc7b5c7ad72af0342333abb7642d2d9eef6a9
2024-11-20T10:41:58.120718193+01:00 container stop 21a85032fee12773a36b060f769dc7b5c7ad72af0342333abb7642d2d9eef6a9
2024-11-20T10:41:58.125265685+01:00 container die 21a85032fee12773a36b060f769dc7b5c7ad72af0342333abb7642d2d9eef6a9
2024-11-20T10:41:58.136485321+01:00 container destroy 21a85032fee12773a36b060f769dc7b5c7ad72af0342333abb7642d2d9eef6a9
2024-11-20T10:41:58.145454536+01:00 container rename 6e5d2f7db6f5a8b2d1b0ae6a42aed70dbb7f039fb51a79599e274e2d4b5f98b4
2024-11-20T10:41:58.804224198+01:00 container start 6e5d2f7db6f5a8b2d1b0ae6a42aed70dbb7f039fb51a79599e274e2d4b5f98b4
...
amir20 commented 6 days ago

Hmmm....

I don't see anything wrong here. I see two containers. The one starting 21 and the other with 6e.

  1. 6e gets created. This event is ignored by Dozzle.
  2. Then renamed. Which gets picked up but ignored because Dozzle never picked up the created event.
  3. Then started which should be handled correclty.

And then other container:

  1. Container 21 get killed. Which is ignored. stoped which is ignored.
  2. die should be picked up.
  3. And finally destroy should remove it.

One thing you can do is run with --level trace for Dozzle. You should see something like "event=.... received container event" which output all the events Dozzle sees.

Perhaps compare those with system events. There might be something else happening. How are you getting to this state? I tried setting a Portainer a while ago but it just seemed more complicated and I couldn't figure out how to use it properly.

FrancYescO commented 6 days ago

It happened 2times during a stack update via portainer webhook (you should register at least a free account for the business edition)

But this happens various time during the day and all keeps working correctly so I'm still not able to find an easy replicable path...

amir20 commented 6 days ago

But this happens various time during the day and all keeps working correctly so I'm still not able to find an easy replicable path.

I wonder if this is a race issue, as it occurs sometimes but not always.

  1. What specific action are you taking on Portainer? Are you updating a container?
  2. And I know you said this earlier, but what is the broken behavior?
  3. Is it that the IDs and names are incorrect? Or something else.
FrancYescO commented 6 days ago

1) CI push an image > call portainer stack webhook > portainer pull and redeploy the stack (the updated pushed image) 2) container is shown on the sidebar but not loading logs as it points to a wrong container ID 3) ID (the one showed also in browser URL)

amir20 commented 6 days ago

I see. So wrong id, correct name. I didn't realize this is part of a stack. I can try it out.

amir20 commented 5 days ago

@FrancYescO, is dozzle part of your stack, or is it always running separately, and the stack is updated independently?

FrancYescO commented 5 days ago

@FrancYescO, is dozzle part of your stack, or is it always running separately, and the stack is updated independently?

I have various stack, and dozzle is part of another stack

amir20 commented 5 days ago

@FrancYescO, no luck. I set up Portainer Community Edition and created a new stack like this:

services:
  echo:
    image: amir20/echo:latest
    command: -r
    deploy:
      replicas: 3 # Specifies three replicas
      update_config:
        parallelism: 1 # Update one container at a time
        delay: 10s # Wait 10s between updating each container
      restart_policy:
        condition: on-failure
        max_attempts: 3

Everything was working well in Dozzle. I redeployed the stack by changing amir20/echo:latest to amir20/echo:v2. I checked the logs, and everything looked fine. In Dozzle, all the IDs matched those in docker ps, and everything functioned properly.

I wonder if the issue is related to the webhooks. Unfortunately, that's beyond the time I have available. I can't seem to reproduce it easily. If you find a reliable way to reproduce it without setting up the entire pipeline, that would be helpful.

FrancYescO commented 5 days ago

in my case the image, have ever the same "latest" tag.... but i don't think that matter... this happened 2 time more today (while the stack was probably redeployed with no dozzle issue 20+ more), but i was still not tracing the events properly i'll update here if i find something more

FrancYescO commented 5 days ago

Ah what surely I should say is that during the portainer redeploy you will NEVER be redirected to the new container (but this can be a different issue related to https://github.com/amir20/dozzle/issues/3036 )

FrancYescO commented 4 days ago

Here we are again, and now i'm sure, here a result of the dozzle filtered log grep -E "e4c57f33f6d8|c8cabeb803ed|Dozzle"

e4c57f33f6d8 > container that remain orphaned after the stack update c8cabeb803ed > the container that replaced e4c57f33f6d8 Dozzle > just to show when i've restarted dozzle

"Error response from daemon: No such container: e4c57f33f6d8" is me clicking on the cointainer in the sidebar and getting no logs...

looks dozzle never received events from c8cabeb803ed, the only thing i see before the Dozzle restart is "starting to stream stats"...

2024-11-22T09:01:50.058746494Z {"level":"info","time":"2024-11-22T09:01:50Z","message":"Dozzle version v8.8.1"}
2024-11-22T11:04:52.017508387Z {"level":"info","time":"2024-11-22T11:04:52Z","message":"Dozzle version v8.8.1"}
2024-11-22T14:03:57.154746555Z {"level":"trace","event":"create","id":"e4c57f33f6d8","time":"2024-11-22T14:03:57Z","message":"received container event"}
2024-11-22T14:03:57.899557141Z {"level":"trace","event":"rename","id":"e4c57f33f6d8","time":"2024-11-22T14:03:57Z","message":"received container event"}
2024-11-22T14:03:57.899577988Z {"level":"debug","action":"rename","id":"e4c57f33f6d8","time":"2024-11-22T14:03:57Z","message":"container event"}
2024-11-22T14:03:57.899582791Z {"level":"debug","action":"rename","id":"e4c57f33f6d8","time":"2024-11-22T14:03:57Z","message":"container event"}
2024-11-22T14:03:58.380355811Z {"level":"debug","container":"e4c57f33f6d8","host":"sasasa","time":"2024-11-22T14:03:58Z","message":"starting to stream stats"}
2024-11-22T14:03:58.380398024Z {"level":"trace","event":"start","id":"e4c57f33f6d8","time":"2024-11-22T14:03:58Z","message":"received container event"}
2024-11-22T14:03:58.380429289Z {"level":"debug","id":"e4c57f33f6d8","time":"2024-11-22T14:03:58Z","message":"Finding container"}
2024-11-22T14:03:58.402801270Z {"level":"debug","id":"e4c57f33f6d8","time":"2024-11-22T14:03:58Z","message":"container started"}
2024-11-22T14:03:58.402823770Z {"level":"debug","action":"start","id":"e4c57f33f6d8","time":"2024-11-22T14:03:58Z","message":"container event"}
2024-11-22T14:03:58.402845560Z {"level":"debug","action":"start","id":"e4c57f33f6d8","time":"2024-11-22T14:03:58Z","message":"container event"}
2024-11-22T14:18:38.038759596Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:18:38Z","message":"Streaming logs for container"}
2024-11-22T14:28:53.477980308Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:28:53Z","message":"Streaming logs for container"}
2024-11-22T14:34:27.811490031Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:34:27Z","message":"Streaming logs for container"}
2024-11-22T14:35:18.459489416Z {"level":"debug","container":"e4c57f33f6d8","time":"2024-11-22T14:35:18Z","message":"streaming ended"}
2024-11-22T14:35:18.459538628Z {"level":"debug","event":"container-stopped","container":"e4c57f33f6d8","time":"2024-11-22T14:35:18Z","message":"received event"}
2024-11-22T14:35:19.570267074Z {"level":"debug","container":"c8cabeb803ed","host":"sasasa","time":"2024-11-22T14:35:19Z","message":"starting to stream stats"}
2024-11-22T14:38:30.085802175Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:38:30Z","message":"Streaming logs for container"}
2024-11-22T14:38:30.086887681Z {"level":"error","error":"Error response from daemon: No such container: e4c57f33f6d8","container":"e4c57f33f6d8","time":"2024-11-22T14:38:30Z","message":"unknown error while streaming logs"}
2024-11-22T14:39:16.178559669Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:39:16Z","message":"Streaming logs for container"}
2024-11-22T14:39:16.180725431Z {"level":"error","error":"Error response from daemon: No such container: e4c57f33f6d8","container":"e4c57f33f6d8","time":"2024-11-22T14:39:16Z","message":"unknown error while streaming logs"}
2024-11-22T14:39:26.350330671Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:39:26Z","message":"Streaming logs for container"}
2024-11-22T14:39:26.351109664Z {"level":"error","error":"Error response from daemon: No such container: e4c57f33f6d8","container":"e4c57f33f6d8","time":"2024-11-22T14:39:26Z","message":"unknown error while streaming logs"}
2024-11-22T14:42:49.928094353Z {"level":"debug","id":"e4c57f33f6d8","since":"2024-11-22T14:03:58Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:42:49Z","message":"Streaming logs for container"}
2024-11-22T14:42:49.929305570Z {"level":"error","error":"Error response from daemon: No such container: e4c57f33f6d8","container":"e4c57f33f6d8","time":"2024-11-22T14:42:49Z","message":"unknown error while streaming logs"}
2024-11-22T14:43:40.137630195Z {"level":"info","time":"2024-11-22T14:43:40Z","message":"Dozzle version v8.8.1"}
2024-11-22T14:43:40.209060944Z {"level":"debug","id":"c8cabeb803ed","time":"2024-11-22T14:43:40Z","message":"Finding container"}
2024-11-22T14:43:43.596250130Z {"level":"debug","container":"c8cabeb803ed","host":"sasasa","time":"2024-11-22T14:43:43Z","message":"starting to stream stats"}
2024-11-22T14:43:55.880136993Z {"level":"debug","id":"c8cabeb803ed","since":"2024-11-22T14:35:19Z","stdType":"all","host":"sasasa","time":"2024-11-22T14:43:55Z","message":"Streaming logs for container"}
amir20 commented 4 days ago

Can you attach the whole logs? I am looking for other errors such as "docker store unexpectedly disconnected from docker events". It doesn't make sense why there is a bunch of "received container event" events and then suddenly they stop.

FrancYescO commented 4 days ago

dozzle-2024-11-22T14-39-12_.log.zip

amir20 commented 4 days ago

OK it's not what I thought. But I have a wild guess. Give me a little time. Let me check something and provide a PR.

amir20 commented 4 days ago

https://github.com/amir20/dozzle/pull/3417 I am just taking a guess here. I noticed that after e4c57f33f6d8 started, no other events were received. So I thought maybe it's blocking for some reason.

I don't know why it would be but this would be a fix. Can you try pr-3417?

FrancYescO commented 4 days ago

ok i'll leave this pr running, i'll update if happens again...

amir20 commented 4 days ago

If it works, it will make me go crazy why. Because currently, it shouldn't block.

By the way, how many people are using Dozzle in a browser? Is it just you or is it a few?

FrancYescO commented 4 days ago

few, but often non-concurrent

amir20 commented 4 days ago

Hmm, okay. That could explain what I am seeing. I'll add more debugging logs later. If my intuition is correct, a deadlock occurs after a new container is found because multiple sessions are trying to write to a channel.

amir20 commented 3 days ago

I am guessing that didn't solve your problem. I didn't realize you had originally greped the logs so only logs with e4c57f33f6d8 were being displayed. Looking at the full log, it looks to be fine.

It's crazy that I don't see c8cabeb803ed at all in the logs except for stats. I am pretty stumped right now.

amir20 commented 3 days ago

Alright, I think we need to do some deeper investigation. I have enabled stack trace dump in https://github.com/amir20/dozzle/pull/3418. You can deploy with pr-3418.

Next time this happens, go to https://<path-to-dozzle>/debug. You should see a page with link to goroutines. Send me that page. I want to understand why container store is not receiving any events. This will tell me if that thread is stuck some where. For a good state, we should see something like /dozzle/internal/docker/container_store.go:176 and container_store.go:56.

Let me know!

amir20 commented 1 day ago

I am going to merge the debug API to master. I felt adding an extra 500KB is not worth it. But whatever. If it was merged before then we could have troubleshooted this issue sooner.

So you can try master if you want.

FrancYescO commented 1 day ago

I've deleted my comments, was nothing related to this issue...still waiting for this to rehappen.

amir20 commented 5 hours ago

So has this not happened yet?

FrancYescO commented 4 hours ago

Now that i'm waiting for it, no 😅

Unfortunately the stack update now are fewer per day so less possibilities to retrigger...and I was just wondering if this can happen only if there are concurrent users using dozzle on the same container

amir20 commented 4 hours ago

It would increase the chances since Dozzle has a complicated backend with all kinds of multi threading and semaphores. Once it happens, the stack trace will help me if any of the routines are stuck any where. The worst thing that could happen is you say it didn't work but the stack trace looks right. Because then I wouldn't know how to go about investigating.

Let's see how next few days go.