amir20 / dozzle

Realtime log viewer for docker containers.
https://dozzle.dev/
MIT License
5.7k stars 287 forks source link

unable to load old logs if there was a log burst #3122

Closed FrancYescO closed 1 month ago

FrancYescO commented 1 month ago

Check for existing issues

How is Dozzle deployed?

Standalone Deployment

Dozzle version

8.0.5

✅ Command used to run Dozzle

just the README example.

Describe the performance issue you are seeing

ref to https://github.com/amir20/dozzle/issues/3113#issuecomment-2230974496

if a container get a log burst (normally the container print 100messages/m but for a timeframe it bursted to 10000 per second) in the past you will be unable to load old logs

i just got a container that printed 40k messages in the same second and trying to scroll up to load the past just crashes the webui as per in the ref comment.

PS. the call stack exceeded is the edge case, normally you just got a BIG freeze of the webui if the loadMore request response contains too much messages (100kb of request size to /logs is enough to notice a nice freeze)

How many containers, hosts, agents, etc. are you using?

30

Screenshots, videos, or logs

No response

amir20 commented 1 month ago

Seem like this works as designed only when the volume of logs is linear. With a big spike like your, it breaks. I have to rethink this.

The big issue is that Docker doesn't allow to say return me the last X logs that are before a certain date. So pagination has to be handled by Dozzle.

Here are some ideas I had:

  1. Load more on API until there is enough to return to UI
  2. Stream on the UI. Feels a bit hacky as the page would jump around while scrolling
  3. Guess how much to load (similar to now), but only return the last X logs. There is some performance cost to still load the logs, but it would be a big improvement since none of that would actually be returned to the UI.

Any other ideas?

I am leaning towards option 3 as there is no clean way to do pagination with Docker API. It would only incur a big hit with burst of logs. But the UI would never freeze since it wouldn't return a large payload.

FrancYescO commented 1 month ago

1) i didn't get right, but this will be worste as if you'll try to load 40k messages on first load (also if not in bulk but per line) you will instantly break the ui for too much datas 2) probably we can find a solution for the scroll but i see the same limitation as 1) ... but now how is calculated the number of messages to load on first stream? 3) maybe this is the simplest implementation: just add a &max=Y to the /logs api, and if you are going to return a number bigger than Y you will crop to the last Y obviously there are 2 downside as you noticed:

i've also just noticed the the 3) is what portainer is actually doing adding the &tail=100 value to the request that is allowed to be edited runtime on the ui to fetch more logs (maybe this can be putted in the dozzle ui settings page)

PS. as already noticed to you in https://github.com/amir20/dozzle/issues/3113#issuecomment-2230197694 for the /logs endpoint, in both cases of using /stream (i never worked with EventSource so i can say something wrong) or /logs endpoint i see differrent ways of generating/handling payloads i think also just avoiding mixing text and json in the payloads can give a big performance improvement on the ui side container logs:image general info:image

FrancYescO commented 1 month ago

just another point: in my case the 40k log spam in the same second was of the exactly the same message, surely this will not cover all cases, but maybe we can think on some that says "message repeated X times" instead of printing all (surely i've seen this kind of implementation in some other log tool..)

amir20 commented 1 month ago

i've also just noticed the the 3) is what portainer is actually doing adding the &tail=100 value to the request that is allowed to be edited runtime on the ui to fetch more logs (maybe this can be putted in the dozzle ui settings page)

That's good info! Thanks. That's what I was thinking too. And I just confirmed that --until and --tail don't work with docker logs --until 42m --tail 10 -t <id>. It returns nothings. So This would have to be something on Dozzle.

i see differrent ways of generating/handling payloads i think also just avoiding mixing text and json in the payloads can give a big performance improvement on the ui side

I didn't understand this point. Can you try again? Evensource is just a long poll in HTTP. Each line can be anything. In streaming mode, it generates each line in real-time which makes sense to do what you see. In /logs, it is fetching a period of time in logs which is why each line is a valid JSON.

What were you suggesting would be better? I don't think I understood that.

just another point: in my case the 40k log spam in the same second was of the exactly the same message, surely this will not cover all cases, but maybe we can think on some that says "message repeated X times" instead of printing all (surely i've seen this kind of implementation in some other log tool..)

I didn't know they are the same. Are your logs identical? Can you share an example? I like the idea of sayihg "messaged repeated x times". I think it's a small fraction of people that would see it though. But now that I know your usecase, I worry if they are all so close to each other. All the pagination is done via timestamp with nanoseconds. For any of this pagination to work, each log would need to differ by at least one nanosecond otherwise they could be skipped which scrolling.

amir20 commented 1 month ago

On side note, I am going to implement option 3 next week and see how it looks.

FrancYescO commented 1 month ago

Here a sample

[...]
2024-07-15T22:34:52.999572303Z Error, skipping parsing row
2024-07-15T22:34:52.999579798Z Error, skipping parsing row
2024-07-15T22:34:52.999606715Z Error, skipping parsing row
2024-07-15T22:34:52.999620796Z Error, skipping parsing row
2024-07-15T22:34:52.999628883Z Error, skipping parsing row
2024-07-15T22:34:52.999636702Z Error, skipping parsing row
2024-07-15T22:34:52.999644630Z Error, skipping parsing row
2024-07-15T22:34:52.999652293Z Error, skipping parsing row
2024-07-15T22:34:52.999661593Z Error, skipping parsing row
2024-07-15T22:34:52.999670313Z Error, skipping parsing row
2024-07-15T22:34:52.999678201Z Error, skipping parsing row
2024-07-15T22:34:52.999686081Z Error, skipping parsing row
2024-07-15T22:34:52.999695177Z Error, skipping parsing row
2024-07-15T22:34:52.999705001Z Error, skipping parsing row
2024-07-15T22:34:52.999712662Z Error, skipping parsing row
2024-07-15T22:34:52.999720363Z Error, skipping parsing row
2024-07-15T22:34:52.999736864Z Error, skipping parsing row
2024-07-15T22:34:52.999747717Z Error, skipping parsing row
2024-07-15T22:34:52.999755778Z Error, skipping parsing row
2024-07-15T22:34:52.999763733Z Error, skipping parsing row
2024-07-15T22:34:52.999771456Z Error, skipping parsing row
2024-07-15T22:34:52.999779524Z Error, skipping parsing row
2024-07-15T22:34:52.999791064Z Error, skipping parsing row
2024-07-15T22:34:52.999799920Z Error, skipping parsing row
2024-07-15T22:34:52.999807428Z Error, skipping parsing row
2024-07-15T22:34:52.999815844Z Error, skipping parsing row
2024-07-15T22:34:52.999823579Z Error, skipping parsing row
2024-07-15T22:34:52.999831506Z Error, skipping parsing row
2024-07-15T22:34:52.999841101Z Error, skipping parsing row
2024-07-15T22:34:52.999849561Z Error, skipping parsing row
2024-07-15T22:34:52.999856576Z Error, skipping parsing row
2024-07-15T22:34:52.999873656Z Error, skipping parsing row
2024-07-15T22:34:52.999882150Z Error, skipping parsing row
2024-07-15T22:34:55.813275038Z process complete
2024-07-15T22:34:56.890768483Z Started new process
2024-07-15T22:34:57.001045216Z process complete
2024-07-15T22:34:57.259727996Z Started new process
2024-07-15T22:34:58.448882218Z Error, skipping parsing row
2024-07-15T22:34:59.362132074Z Error, skipping parsing row
2024-07-15T22:34:59.489544057Z Error, skipping parsing row
2024-07-15T22:34:59.878302292Z Error, skipping parsing row
2024-07-15T22:34:59.969021277Z Error, skipping parsing row
2024-07-15T22:35:00.145349798Z Error, skipping parsing row
2024-07-15T22:35:00.264244440Z Error, skipping parsing row
2024-07-15T22:35:00.370022338Z Error, skipping parsing row
2024-07-15T22:35:00.687207332Z Error, skipping parsing row
2024-07-15T22:35:00.770923882Z Error, skipping parsing row
2024-07-15T22:35:01.032213603Z process complete

on dozzle can be converted to something like (oviusly you will lose the info of timestimp of all logs after the first one):

2024-07-15T22:34:52.999572303Z Error, skipping parsing row ** badge saying row repeated 40000 time **
2024-07-15T22:34:55.813275038Z process complete
2024-07-15T22:34:56.890768483Z Started new process
2024-07-15T22:34:57.001045216Z process complete
2024-07-15T22:34:57.259727996Z Started new process
2024-07-15T22:34:58.448882218Z Error, skipping parsing row ** badge saying row repeated 10 time **
2024-07-15T22:35:01.032213603Z process complete
amir20 commented 1 month ago

That's actually a good idea. Probably do both and trim after duplicates have been removed. This would only work with simple logs though. Removing duplicates can be done as a follow up.

amir20 commented 1 month ago

I tried playing around with burst of 10k lines. And its horrible 🫠. I'll see if I can optimize it. This isn't a common use case so it will take me a while to figure out where to optimize.

amir20 commented 1 month ago

@FrancYescO https://github.com/amir20/dozzle/pull/3151 should fix it partially where scrolling up limits the size of logs. I'll merge this in master. There ia a few other improvements for agents and performance.

I probably won't do deduping for a while as that requires a bit more work.

FrancYescO commented 1 month ago

nice going to test soon, I have just to say that the 500 is a pretty important value for who is managing with lot of "previous" logs to leave hardcoded

amir20 commented 1 month ago

nice going to test soon

👍🏼

I have just to say that the 500 is a pretty important value for who is managing with lot of "previous" logs to leave hardcoded

That's right. I think the number will be dynamic in the future. I am pretty opinated that this number should be hidden from users. The number might change depending on resources or configurations. But if you remember, this number used to be configurable and I had seen people with 10,000 which didn't make sense because that would cause Dozzle to load too much data. So going to keep this settings private for now. Dozzle should be smart enough to just adjust.

FrancYescO commented 1 month ago

just to confirm that the workaround is working, now i can happily go indefinitely up thought my 40k of same log rows 👯

PS. maybe can be useful to add a "you reached the start of logs", actually when reaching the top and retry a scrollup, it will try to load something but the api will keep returning empty ;)

amir20 commented 1 month ago

That's great. I was hoping it would work.

I am still fixing perf issues so once everything is done maybe I can look at little things like "you've reached the start"

amir20 commented 1 month ago

I think this has been fixed now. I'll probably come back eventually and do other fixes but they will be new features.

FrancYescO commented 3 weeks ago

hello, just want to avoid opening a new issue but this issue is pretty worse when the log burst come from a live container that you are following realtime, just to give a real case i'm using sftpgo with logs at the debug level, when an client came to the sftp server a download folders a lot of log are generated and the dozzle ui will be totally freezed... i was getting crazy as dozzle was keeping freezing "randomly" than i just realized that this was happening only on that container that have a little complex json logs

the culprit i'm pretty sure is in the parseMessage func, but idk how this can be optimized avoiding locking the main presentation thread freeze the full browser tab while parsing

amir20 commented 3 weeks ago

Thanks for the follow up.

How many logs does sftpgo produce? I think regardless if it is parseMessage, at some point, regardless of any solution, 40,000 of logs will freeze the browser.

Perhaps, a better solution is to think outside the box. For example, before we were thinking of saying "repeated (40000) times". Is it the same case here that they are the same message?

I wonder if there is away to make parseMessage run outside of the UI worker, eg service worker or something else.

FrancYescO commented 3 weeks ago

Is it the same case here that they are the same message?

oviusly no, the 40k repeated was just another my specific case, but here we have just some verbose "genuine" logs here a sample:

image
amir20 commented 3 weeks ago

About how many of those are you getting? It looks like you are getting a lot in a second. Can you take a guess on the volume?

FrancYescO commented 3 weeks ago

Time | Line count

22024-08-22T20:52 2377
2024-08-22T20:53 3484
2024-08-22T20:54 124
2024-08-22T20:55 3608
2024-08-22T20:56 3608
2024-08-22T20:57 3608
2024-08-22T20:58 2293
2024-08-22T20:59 1320
2024-08-22T21:00 3633
2024-08-22T21:01 3608
2024-08-22T21:02 3608
2024-08-22T21:04 3608
2024-08-22T21:05 3608
2024-08-22T21:06 3608
2024-08-22T21:07 3484
2024-08-22T21:08 129
2024-08-22T21:09 3608
2024-08-22T21:10 3608
2024-08-22T21:11 3608
2024-08-22T21:12 1636
2024-08-22T21:13 1972
2024-08-22T21:14 3608
2024-08-22T21:15 3608
2024-08-22T21:16 3608
2024-08-22T21:18 3613
2024-08-22T21:19 3608
2024-08-22T21:20 3608
2024-08-22T21:21 3484
2024-08-22T21:22 124
2024-08-22T21:23 3608
2024-08-22T21:24 3608
2024-08-22T21:25 3608
2024-08-22T21:26 2672
2024-08-22T21:27 936
2024-08-22T21:28 3613
2024-08-22T21:29 3608
2024-08-22T21:30 3608
2024-08-22T21:31 276
2024-08-22T21:32 3332
2024-08-22T21:33 3608
2024-08-22T21:34 3608
2024-08-22T21:35 3484
2024-08-22T21:36 124
2024-08-22T21:37 3608
2024-08-22T21:38 3613
2024-08-22T21:39 3608
2024-08-22T21:40 2896
2024-08-22T21:41 712
2024-08-22T21:42 3608
amir20 commented 3 weeks ago

So almost 3.5k/second. That's a lot. Have to think about this one.

FrancYescO commented 3 weeks ago

Average Lines Per Minute: 2921 Average Lines Per Second: 351