Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.06k forks source link

Disk journal stopped working after disk filled #2613

Open JulioQc opened 8 years ago

JulioQc commented 8 years ago

Expected Behavior

Disk journal should resume processing queued messages

Current Behavior

Processing was paused and messages kept queuing in disk journal.

Possible Solution

unknown

Steps to Reproduce (for bugs)

  1. Let the disk fill to 100%
  2. Clear some space manually via SSH terminal
  3. Restart server
  4. Check Processing Status and Disk Journal under System -> Node -> Details and check the server logs.

    Context

After the disk filled to 100% (misconfig from me) and the issue was fixed, server was restarted but disk journal messages processing did not resume. I can add the web and API interface stopped responding when disk was full because mongoDB could not launch.

By looking at the server logs, I get this error:

2016-08-04_14:20:54.49828 2016-08-04 10:20:54,494 ERROR: com.google.common.util.concurrent.ServiceManager - Service JournalReader [FAILED] has failed in the RUNNING state.
2016-08-04_14:20:54.49830 java.lang.IllegalStateException: Invalid message size: -897035486
2016-08-04_14:20:54.49830       at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:127) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.read(LogSegment.scala:147) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at kafka.log.Log.read(Log.scala:443) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:462) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:435) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.JournalReader.run(JournalReader.java:136) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:60) [graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) [graylog.jar:?]
2016-08-04_14:20:54.49832       at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

Your Environment

kroepke commented 8 years ago

The error looks like the journal has been corrupted when it ran out of disk space. It might be possible to delete the latest journal segment file while Graylog is stopped, but I'm afraid the message in that segment cannot be recovered.

From the code side, I don't think we can sensibly recover from this. It is really important not to run out of disk for journalling, like it is with databases.

JulioQc commented 8 years ago

Yes, I can agree with you and I've also noticed the warning by Graylog when the disk reached near max capacity. However, a mechanism to recover from such events would be very helpful to facilitate the handling of those events (although clearing "/var/opt/graylog/data/journal/*" and restarting graylog isn't that hard either).

AVGP commented 8 years ago

I have just run into the same issue.

Sorry if the following question is silly, but:

I had noticed Graylog wrote a bunch of messages into the journal after I cleaned up some space, so I don't know which segment was faulty.

I have moved the journal files out of the way instead of deleting them. Now my question is:

Can I stop Graylog and put the files back in place one by one to see which one is the culprit?

JulioQc commented 8 years ago

From my understanding of the journal, it wont allow this since the order messages arrive is important. (see slide 5 here: http://www.slideshare.net/Graylog/graylog-engineering-design-your-architecture)

You basically have to flush it all out and restart it.

giz83 commented 7 years ago

Try stopping the graylog services, delete just the .index files, keep the .log files, restart graylog Worked for me

jimbocoder commented 7 years ago

Same situation. In my case, I found success by:

  1. stop graylog-server
  2. backup the journal/ directory
  3. delete all the .index files
  4. delete the single oldest .log file, since this would have been present when the corruption occurred
  5. start server and have fun watching your cluster churn through a bajillion queued messages
raphaelsalomao3 commented 7 years ago

@jimbocoder 's solution did it for me. Thank you.

kieulam141 commented 6 years ago

@jimbocoder should we to delete file graylog2-committed-read-offset and recovery-point-offset-checkpoint and should we delete all .log file?

jimbocoder commented 6 years ago

@kieulam141 I can't say for sure. Whatever you do make sure you do the backup step and it should be okay in the end.

BrijToSuccess commented 6 years ago

@kieulam141 did you have to delete them to get it working?

jimbocoder commented 6 years ago

@BrijToSuccess I don't recall at this point but I'm pretty sure the least destructive strategy is in step 4:

4. **delete the single oldest .log file**, since this would have been present when the corruption occurred

(instead of deleting all the .log files.)