Background CPU usage while not used

zvirja commented 9 months ago

Describe the bug This is continuation of #2765.

This is constant CPU usage I observe on my Raspberry PI 4 while having Dozzle UP with around 60 containers on version 6.2.4:

With Dozzle DOWN - just around 1-2% for both dockerd and containerd

@amir20 You mentioned earlier a possibility to turn off background processing and I think we should revamp that conversation.

My main usage of Dozzle is to diagnose things once in a while in case anything goes wrong. Like once or twice a month, sometimes even more rarely. But for those cases I still want to have service available so I can quickly check instead of having to SSH and start it.

For me Dozzle is that kind of software which you use only on-demand in very specific cases, it's not for everyday constant usage, like PiHole DNS or Nginx. So it should only consume CPU while it's being used and we should not pay penalty for just having it hanging around for that "one day".

You mentioned earlier about complications around design and as I developer myself I can totally see it. But we should see a broader picture when making a decision, simpler is not necessarily the best in every single case. Dozzle is a quite popular product with around 4k stars, so many people are using this. Most of the people are probably just like me, they run it and forget about it. Slightly higher CPU usage is not necessarily the issue for every single usage, but in total it's a huge waste of electricity and eventually the environmental impact. I am lazy enough to calculate how much it costs in power consumption for my RPI running 24/7 when using around 30% CPU single core all the time. But I would imagine that at huge scale those will be number that are actually significant. And if we have a theoretical design possibility to fix it, I believe we definitely should.

I recently stumbled upon IKEA discontinuing TRADFRI smart switch I use and love a lot in favor of having bulkier RODRET. I was curious what was the decision, as the previous design was definitely more neat. Then I found this discussion that it was about battery you can use and a fact that AAA could be rechargeable, while CR2032 is not. One would think that it doesn't matter in every single case, most likely people are not using rechargeable batteries anyway and will just throw away AAA the same way. New design is definitely worse and uglier, bulkier, more visible on a wall and I imagine IKEA designers/engineers were not necessarily happy about it in comparison to the old one. But still they rolled it, because at huge numbers the impact is visible. When you are used a lot across the planet you can feel it.

Of course the impact Dozzle does in comparison to CR2032 battery pollution is negligible. But still this project is popular enough to start thinking about it and it makes the difference. Wasting CPU and electricity in many data centers or people apartments just because some design is simpler than another could be OK if you are a niche product, but not when you are used by hundred of thousands.

Don't get me wrong. I didn't come here to preach you about environmental impact or how to write software - not at all 😄 It's rather a perspective you should definitely consider and it's why I believe we should fix CPU usage in background. 30% could be not a lot, but it could be enough for some people to e.g. throw away their old RPIs and upgrade to newer ones making waste of resources and increasing demand/price, even though old one could have been still used for some time. And end users cannot change things a lot, only to use or not-to-use software, while we as developers have that power to make this world a bit better 😊

To Reproduce Start Dozzle container, don't use it in browser and just monitor CPU usage

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Debian 12
Docker version: 25.0.3
Version 6.2.4

amir20 commented 9 months ago

@zvirja wow that's a lot to read haha. I was actually in the energy space before too. A couple of thoughts:

2765 was to minimize Dozzle's CPU usage. I knew I had to come back to containerd.
For me, I have observed that with 100 containers, virtual machine services on my Mac goes from 20% to 50% with Dozzle. So I think it is worth improving this as that is a lot for some users.
This is an unfortunate bug with Docker where their API doesn't create a single connection for all stats. Each stat is an API call. You can see this too with their CLI using docker stats. For me using docker stats jump CPU to 50%. So I am essentially doing the same thing.

Now some possible solutions. I am not a big fan making users think about flags, configurations, etc...It would be great if Dozzle was smart enough to stop running background processes if unused. Most people just use the default flags with Dozzle so I don't want to spend too much time on providing flags that don't get used often. Here is what I think you could be ideas:

Dozzle can be smart enough to stop processing stats after 24 hours? 48 hours? Pros would be that it would stop using CPU when gone idle, con would be, it wouldn't still satisfy my need when I am trying to identify why a container died.
Another option, only store stats for containers that have been viewed on the UI. The con would be that if a container never dies it would never stop streaming the stats.
Another option would be to only track "starred" containers. This would be harder since the UI doesn't communicate back with server. But it could be a good idea if it did.
And obviously, make it a flag to just stop the streaming of stats. I am not a big fan of this because not only it introduces new flags, no one will read the docs and I will still get issues why CPU is high.

Did I miss any options? What do you think is the best option. I think maybe the simplest option would be option 1 to limit to 24 hours.

zvirja commented 9 months ago

Thanks for the fast and very verbose reply! I appreciate your time and the effort you put 😊

Before we go into the options discussion, could you please elaborate a bit on why exactly we have to store stats in background? Sure, while the page is opened, you need that to show stats. But for the remaining logs and status, I thought that it could be requested on demand. Like here it says that you can run docker logs on stopped container, so you can see why it crashed (if you are showing stopped containers on UI).

I apologize in advance if I miss something obvious, maybe I am not using all the Dozzle features.. Thanks!

amir20 commented 9 months ago

I use Dozzle across all my production instance to check the logs. But even more importantly I check to see if there are any memory leaks. There are few major uses cases that is missing:

docker logs only returns the logs. But docker stats is needed to understand the "health" of the container. Once a container dies, the stats are gone. Without background processing, Dozzle can only see the stats while browser is open. But if a container dies while Dozzle is closed, you cannot see the history of CPU/MEM. In fact, no tool does that, unless you want to pay for things like Datadog. The logs don't say OOM. But the graphs show a consistent incline to 99% memory until it exits.
Second, in the future, I want to be able to setup alerting. If a particular container starts using a lot of memory then I want to know while I am not at my computer. (This is not currently implemented).
Last, I always felt it was broken experience when refreshing tabs, all the historical tabs would disappear.

To me, Dozzle should be more than a logging viewer. It should help people debug issues even while they don't have Dozzle open.

amir20 commented 9 months ago

I think it's also worth noting that while testing with ~20 containers on Ubuntu running in AWS, I only see 1% - 2% CPU. So possibly another option is to just limit the number of containers to the most used containers or something else that will use historical data.

On average people are using 35 containers with the 75 percentile <30. So perhaps this won't really impact a lot of people.

amir20 commented 8 months ago

Any thoughts? I can close if nothing needs to be done.

zvirja commented 8 months ago

Good evening! Sorry for a delay, was a bit busy with other things.

Thanks for the detailed reply. Now I start to understand what the issue is. So Dozzle consists of the following major parts:

Container enumeration + health info
Container log viewer
Container resource usage display + monitoring (MEM + CPU)
Future plans for alerts

It's only the advances features like resource usage view or alerts that require the background Docker stats processing which causes high CPU consumption. The basic container enumeration and log viewing does not require that and would work perfectly fine by just calling Docker API on demand. Correct me if I am wrong with that.

Your vision is that Dozzle is not just a "docker log viewer", but something more advanced. That is where I believe we could start the first part of the discussion.

I don't know if you have detailed analytics around how many people actually use CPU/RAM stats and if we even have a way to measure that accurately. That would help you with proper analysis on how to position and develop this product in future and which trade-offs to make.

I will describe you my personal use case. I have a couple of VPS instances (including Raspberry PI at home) where I host services for private usage. 99% of all the services are hosted as Docker containers. The restart policies are configured to always or unless-stopped, so when crash happens I might not even notice that (and generally I don't care, as long as services work). I typically don't care about logs as long as everything works and usually only check those when:

I setup a new service or re-configure the existing ones
Some service does not work as intended, so I check their logs (e.g. some of my custom Telegram bots I host fail, so I check logs to see the detailed exception)
Sometimes to review whether I configured things properly and services like Watchtower do their job.

I don't have a need to monitor memory and CPU, as for my self-hosted scenario I don't have power-intense containers and most of my services are idle for most of the time. When I had a need to monitor CPU/MEM usage (e.g. for Immich), I used Dockprom and that worked perfectly fine. Dockprom also constantly consumed resources (especially cAdvisor), so I turned it off after some time when I confirmed that things are fine.

For me personally Dozzle is just a log viewer and I don't need anything else from it. Mem/CPU graphs are a nice addition, but that's definitely not something I would pay for with constant 30% CPU usage. I am curious how many people are actually using Dozzle exactly as I do.

Currently Dozzle does not do any persistence (at least with example docker-compose.yml), so I don't see this product as a reliable monitoring tool:

If you restart the Dozzle container (or server is rebooted; or tool crashes; or tool is OOM evicted), the stats are lost.
The graph viewer does not tell you time and window seems to cover around 15-30 mins only
The amount of recorded stats is way smaller than you collect with Prometheus + cAdvisor (e.g. when using Dockprom I mentioned above). For proper diagnostics usually MEM alone is not enough, as we have many different stats for memory (committed, virtual, residential, shared, etc)
Currently tool cannot compete with Grafana + Prometheus + cAdvisor experience and I don't think it's reasonably realistic to re-implement that.

If I would have a production environment, I would definitely use something like Dockprom and would not rely on Dozzle alone to monitor resources usage. Moreover, I most likely would use Kubernetes, use Prometheus to track stats and collect log messages with appropriate solutions (like Seq) - that would be an entirely different story.

From this perspective I would say that historical stats view should be an optional thing which should be possible to turn off. If one would get it for free or nearly for free - that is one story. But 30% of CPU is a too high price. Even 1-2% of constant CPU usage (which you measured for ~20 containers, not avg 30) is too much for a feature you don't use. Especially when you have tens or hundreds of thousands of installations (I don't know real usage stats).

If you ask my opinion, I would be completely fine with recording history only while it's used:

Define the time window size (i.e. how long CPU/MEM graph represents). E.g. 10-15 mins
If you have tabs open - periodically send heartbeat
Track only for [time window size] since last heartbeat

So this way charts data will not disappear on refresh and it will be OK-ish experience.

As for seeing stats for crashed containers right before a crash or future alerts - I would make it an optional module which should be possible to turn off. As I am quite sure that the vast majority of users would never use this tool for that.

But saying all of that, I have to also say the following. You definitely have your personal use case which you develop this product for, I see it. And from what you say, the tool in its current shape solves the problem you have. You build it primarily for your own use case, not for ours. Neither me nor any other random guy from the Internet can dictate you what to do, as you spend your personal time and it is a free product (at least as of today). It is completely your personal decision where you want to invest your time and how you develop and envision this product. And if for your personal use case 30% of constant CPU is OK (or you simply don't have 60 containers), then it's understandable if you decide to keep it as it is. It's simply unethical for me to come and say "hey, please remove the feature you use, because it does not work for me". I see all of that.

Probably what I am asking about is to give us a choice to opt out. If we cannot solve the problem nicely to fit us all, then let's have a config flag to turn off the feature. Keep it convenient for yourself, but allow guys like me to tweak your tool. Mention that in the documentation somewhere, so people will make their own choice. Or e.g. introduce a setting like BACKGROUND_MONITORING_CONTAINERS_LIMIT with a reasonable default, so it works OK OOTB and other guys could tweak it to e.g. 0 to turn it off entirely.

P.S. Sorry if it's a bit too long to read. I am also a bit tired after work and writing such a long message, so if some parts are a bit dumb - sorry for that.

zvirja commented 8 months ago

After thinking a bit more, I feel it's completely OK if you turn off background stats processing entirely via a flag. So you loose data if you refresh tab - that would still work ideal for my use case and looks like a trivial solution. Users will also be able to clearly see how it works.

That is if you cannot invent anything smarter and are about to give up - then please implement it at least like this.

amir20 commented 8 months ago

Hi @zvirja,

But saying all of that, I have to also say the following. You definitely have your personal use case which you develop this product for, I see it. And from what you say, the tool in its current shape solves the problem you have

That's right that I mostly build for myself. I have gotten an overwhelming request for features though. Yesterday I tried implementing some ideas to see if there would be any complexity. As it turns out, I had missed a big part of why I implemented this background thread. The old behavior was that Dozzle would stream stats for all containers per user. So if there were 100 containers and 3 people using Dozzle, it would be 300 live threads to Docker. With the new background thread, there will always be at most 100 threads.

So now based on your feedback, I have to think about balancing multi-user use cases vs single user with a lot of containers. One of my goals for Dozzle is to be usable by bigger teams. Which is why I introduced mult-user authentication and provided other kinds of authentication using proxy.

Even in the case of introducing a flag, Dozzle would incur a lot of CPU when two or more people are using it. A little bit like a tankless water heater when everybody in the house is taking a shower. :)

That is if you cannot invent anything smarter and are about to give up - then please implement it at least like this.

So I keep coming back to my original hypothesis where there would be one background process powering all users, but after some time of inactivity, it would turn off itself. This way, no flags are needed and it would optimize for all situations. A little bit like a water heater with a tank that we turn off when no one is home.

zvirja commented 8 months ago

Yes, I think using shared background worker to collect stats and share it to all the clients is definitely the right thing to do. And if that's the given, then indeed it would make sense to have some "inertia", so that you keep collecting stats for some time after nobody is listening anymore. That is for the cases when e.g. people refresh their pages and network is slow or other.

All of it will still not solve your original problem of diagnostics of stopped containers, as you will not collect stats continuously. I guess that's something you'll have to accept. Or implement it using that idea of "trackable/favorite" containers. Or something else.

Thank you for investing your effort into solving this problem, I do appreciate it a lot! 😊

P.S. I love you analogies to the physical things 😊 Am curious what shared worker thread model would look like with your shower example - is that like everybody is washing re-using the same water from each other? 🤔😅

amir20 commented 8 months ago

All of it will still not solve your original problem of diagnostics of stopped containers, as you will not collect stats continuously. I guess that's something you'll have to accept. Or implement it using that idea of "trackable/favorite" containers. Or something else.

For me personally, I only use Dozzle in the morning and then throughout the day check for any errors. So I think I can do a hybrid approach.

After a user disconnect trigger an event
If there are more than 40 containers then schedule shutdown in 12 hours. In your case, this should work perfectly as you have more than 40, and check once or twice a month. So you should see the usage go down.
In that 12 hours, anybody who refreshes the page should see all stats reloaded
Anybody who has less than 40 containers, I would let Dozzle just keep doing it's thing. This works out for me since I have less containers and when <40, it usually doesn't use more than 5%.

Of course 12 hours and 40 containers were chosen arbitrarily based on data

My analogy would be still reusing a water heater :) In this case, maybe tank water heater is more efficent with more people showering.

amir20 commented 8 months ago

Created #2780. I realized I should just stop the stats collection, but keep listening for new containers as it keeps the design cleaner.

Can you try amir20/dozzle:pr-2780?

You should be able to go to dozzle, and after 12 hours see something like stopped collecting container stats. This only works if you have more than 50 containers. Which after looking at a bunch of data, I think is the right thing to do for a hybrid approach.

Not many people have 50+ containers so I think I'll have to rely on you to know if it works.

amir20 commented 8 months ago

Did you try the PR?

amir20 commented 8 months ago

@zvirja I haven't heard back. It looks like this issue doesn't impact a lot of people, so I think I am going to close it. I don't want to add a lot of complexity if not needed.

I think for your use case, since you don't use Dozzle a lot, you might want to consider using https://github.com/acouvreur/sablier where it automatically starts the containers when first used. I was thinking this might actually be a better solution than adding all the complexity to Dozzle for shutting down.

zvirja commented 8 months ago

@amir20 You give me very tough deadlines for responses 😉 I was on a retreat and just came back, so obviously didn't have time to respond properly. Sorry for making you to wait.

I might give sablier a try. If it does what it promises, then indeed it could be a good solution for me personally. It's a heavy gun though. For the Traefik I am using I have to go and install plugins to it, use static configuration instead of one based on labels. It's painful and inconvenient to do that, as it will not be self-contained in Dozzle deployment as it is today.

I still think that generating 30% CPU load (or even constant 1-2%) for a feature most likely not needed by a most of the people is a bad design choice. I would not do that for the products I develop myself. But OTOH I don't develop products used by such a big audience, so who am I to tell you this (though I co-maintain big libraries like NSubsitute and AutoFixture, but those are for testing, not prod usage).

Would it be possible to just introduce a setting to not run background processing at all? Having all the issues you described above with stats flickering, multi-tab load, etc? Just a kill-switch, would it be possible to do with a simple effort?

amir20 commented 8 months ago

Did you try the PR?

You can still try the PR 🤔

Would it be possible to just introduce a setting to not run background processing at all? Having all the issues you described above with stats flickering, multi-tab load, etc? Just a kill-switch, would it be possible to do with a simple effort?

No, as mentioned earlier, trying to optimize for single thread for many users would be super complicated. I think the only option would be to just disable stats.

But in reality, no one else has complained about this. I think it would make sense to spend time on it if more people have this issue. I can't optimize Dozzle for everybody. So what I do is look for most common features.

zvirja commented 8 months ago

But in reality, no one else has complained about this. I think it would make sense to spend time on it if more people have this issue. I can't optimize Dozzle for everybody. So what I do is look for most common features.

Well, the fact that nobody noticed it and reported it (because it's not obvious at all that dockerd/containerd processes are taking CPU because of Dozzle) does not change a fact that we are wasting electricity for nothing. The premise I described in my initial message still holds. We could ignore it and pretend it's not an issue, but the reality is that it exists and we both know it. Even 2% of constant CPU load for thousands of people is a HUGE waste for the non-used feature.

Saying all of it I can still understand that you accept it and don't want to complicate the code (even though implementing that background worker is not that complicated IMO). These are the trade-offs, philosophy, practicalities and decision you feel is right for you and I can totally see it. I just want us be very explicit about resources waste this decision implicates, as problem is there and will be tomorrow and in a month even if nobody else mentions it.

Thanks for your fast responses and the time you spent on this. I really appreciate it! Hope in future some new circumstances will come or you change your views or something else will happen and this issue will be fixed.

amir20 / dozzle

Background CPU usage while not used #2774

2765 was to minimize Dozzle's CPU usage. I knew I had to come back to `containerd`.

amir20 / dozzle

Background CPU usage while not used #2774

2765 was to minimize Dozzle's CPU usage. I knew I had to come back to containerd.

2765 was to minimize Dozzle's CPU usage. I knew I had to come back to `containerd`.