Open JulioQc opened 8 years ago
The error looks like the journal has been corrupted when it ran out of disk space. It might be possible to delete the latest journal segment file while Graylog is stopped, but I'm afraid the message in that segment cannot be recovered.
From the code side, I don't think we can sensibly recover from this. It is really important not to run out of disk for journalling, like it is with databases.
Yes, I can agree with you and I've also noticed the warning by Graylog when the disk reached near max capacity. However, a mechanism to recover from such events would be very helpful to facilitate the handling of those events (although clearing "/var/opt/graylog/data/journal/*" and restarting graylog isn't that hard either).
I have just run into the same issue.
Sorry if the following question is silly, but:
I had noticed Graylog wrote a bunch of messages into the journal after I cleaned up some space, so I don't know which segment was faulty.
I have moved the journal files out of the way instead of deleting them. Now my question is:
Can I stop Graylog and put the files back in place one by one to see which one is the culprit?
From my understanding of the journal, it wont allow this since the order messages arrive is important. (see slide 5 here: http://www.slideshare.net/Graylog/graylog-engineering-design-your-architecture)
You basically have to flush it all out and restart it.
Try stopping the graylog services, delete just the .index files, keep the .log files, restart graylog Worked for me
Same situation. In my case, I found success by:
journal/
directory.index
files.log
file, since this would have been present when the corruption occurred@jimbocoder 's solution did it for me. Thank you.
@jimbocoder should we to delete file graylog2-committed-read-offset
and recovery-point-offset-checkpoint
and should we delete all .log file?
@kieulam141 I can't say for sure. Whatever you do make sure you do the backup step and it should be okay in the end.
@kieulam141 did you have to delete them to get it working?
@BrijToSuccess I don't recall at this point but I'm pretty sure the least destructive strategy is in step 4:
4. **delete the single oldest .log file**, since this would have been present when the corruption occurred
(instead of deleting all the .log
files.)
Expected Behavior
Disk journal should resume processing queued messages
Current Behavior
Processing was paused and messages kept queuing in disk journal.
Possible Solution
unknown
Steps to Reproduce (for bugs)
Context
After the disk filled to 100% (misconfig from me) and the issue was fixed, server was restarted but disk journal messages processing did not resume. I can add the web and API interface stopped responding when disk was full because mongoDB could not launch.
By looking at the server logs, I get this error:
Your Environment