Closed Tester98 closed 7 years ago
That does look like a "bug" of sorts, although it would probably be in the lumberjack library: https://github.com/natefinch/lumberjack
I haven't had a chance to look into the code but my guess is that it's not properly synchronizing rotations if concurrent requests hit at just the right moment. /cc @natefinch - have you seen this before?
Bugs in lumberjack!? unpossible! :)
But seriously... lumberjack has a lock around every call to write, so multiple goroutines isn't a problem.
The only way this could happen is if there are two processes writing to the same log file. Since keeping track of the log file size is done in-process (by just counting the number of bytes written to it), if you had two processes writing to the same log file, then they wouldn't know about each other's writes, and could go over the max.
A single copy of caddy was running on the node. NOTE: tail should not be affecting it AFAIK
@natefinch What about multiple instances of lumberjack.Logger? @Tester98 didn't post the entire Caddyfile (tsk, tsk) so I'm just guessing... would multiple instances with the same output file configuration act the same as running it from two processes?
a separate error log is there, but it writes to different file errors { log error.log { size 100 # Rotate after 100 MB age 30 # Keep rotated files for 30 days keep 10 # Keep at most 5 log files } }
Yes, multiple lumberjack.loggers pointing at the same log file would have the same problem, since it's internal to the value
@Tester98 Do you only have those two logs in your whole Caddyfile? Do you have multiple site addresses for the server block? Can I see your entire Caddyfile please?
two Loggers pointing at different logs should be no problem... there's no global state in the package.
Yes, only 2 log lines are there. Cannot give the entire config file as i am just the webserver admin (with 2 more web admins in different shifts) and is maintaining the thing, anything that i share goes through atleast 2 managers approval.Hope you can understand. btw, a tail process was running
The only way lumberjack should go over its quota is if something else writes to the file. Tail would be fine, since it doesn't write.
the server config was also reloaded by sending usr1 signal, if that would help.
How many site addresses are defined for the Caddyfile, that uses the log directive?
4 site address in 1 block of the xyz.com, www .xyz.com, https: //xyz.com and https:// www .xyz.om
@Tester98 Bingo. Strip that down to just 1 address and hit it again (the https and http variants are redundant anyway if you're using auto HTTPS). (See, this is why the whole Caddyfile is necessary.)
anything that i share goes through atleast 2 managers approval.
Sorry about that. :-/ This is one reason why I'm considering a paid option for companies who want private support so you won't have to jump through these hoops.
By the way, the point is that if stripping the Caddyfile down to just one site address (just for your test -- hit it with a lot of requests in this timeframe) fixes the problem consistently, then I have a pretty good idea for a fix.
@natefinch Thanks for chiming in. :+1:
This is a production system, can't strip down.
@Tester98 You could test it locally? ;)
Shift change is underway, i had the time to talk to other admin and he has informed me that goaccess https://goaccess.io/ was also running during that point in time. So it may not be caddys fault. i have gone through goaccess source code at https://github.com/allinurl/goaccess/blob/e42ac72965b419dafe736d7a4c75226471209a47/src/error.c line 100 and it does seem to open the log file in append or write mode, so 2 writers are now confirmed. so this can be closed. I will also report it to goaccess so they can fix the write permissions, since goaccess is analysis tool and should never write to access log and thus never open file in write mode
Nice detective work. That could certainly be the problem. Thanks for looking into it!
@Tester98 goaccess does not open the log in write mode nor append mode. It does it in read mode. Here's where it attempts to open it.
I have a hunch if it is a Caddy issue, but I need @Tester98 to try an experiment locally before I invest the time and complexity for a fix.
Since we are out of ideas, we can keep it closed but will reopen if it occurs again or someone else reports a similar issue.We will also test it internally ( by setting log file to rotate at 1 MB and firing queries ) and will report back if something fails.
Since we are out of ideas
What? I'm not out of ideas! Please try it locally (slam it with requests, you can use a tool like vegeta or wrk) and then remove all but one site address from the Caddyfile. Then do it again. If it doesn't happen, I know a fix. I know a fix. I know a fix.
We use sniper https://github.com/btfak/sniper for quick load testing, however the bug never occurred. However, discussion with the webmaster gave this information. He was on console (as server is located in-premises), and monitoring access.log. Suddenly access log stopped scrolling, admin found log file rotated. He tailed the new file. but found no logs scrolling for roughly a minute. surprised he saw another access.log created. I have dumped all this so the authors can get into the event information. Maybe the logs could not be added to newly created file instead appended to old file. i hope devs can figure out something.
@Tester98 I've pushed a possible fix in 04da9c7374ea07dc19f9a42cd2696ed5c45cfffb to the combinedrollers
branch. Can you please build from that branch and see if it resolves the rotation issue?
@Tester98 Ping. :) Does the patch I submitted work for you?
We tested the patch. The bug has not repeated itself in the last 10 days. so the bug can be closed
Excellent, thank you! We'll merge it in for the next release.
1. What version of Caddy are you running (
caddy -version
)?Current release
2. What are you trying to do?
Access logs rotation
3. What is your entire Caddyfile?
4. How did you run Caddy (give the full command and describe the execution environment)?
5. What did you expect to see?
Log file created, trimmed at proper size and new log files created. however logfiles of greater size was created, then a small file was created. If this bug triggers in production, then randomly previous access files could get deleted as keep limit would touch
6. What did you see instead (give full error messages and/or log)?
wrong sized file, see ls -lh access* -rw-r--r-- 1 root wheel 100M Jan 18 06:56 access-2017-01-17T09-01-49.164.log -rw-r--r-- 1 root wheel 106M Jan 18 17:58 access-2017-01-18T17-39-41.106.log -rw-r--r-- 1 root wheel 344B Jan 18 17:40 access-2017-01-18T17-45-42.329.log -rw-r--r-- 1 root wheel 70K Jan 18 17:41 access.log
7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?
Firing lot of requests