Restarting LogStash shipper appears to resend all the data

mrdavidlaing commented 11 years ago

When restarting the LogStash shipper service we get a larger volume of messages sent that would be expected if were were just catching up from a last log event sent.

Its almost as if the since_db data is being ignored, and all messages from the beginning of the log files are resent.

At this stage this is just a speculation as to the cause of the surge, so first step will be to isolate what the actual surge of messages look like.

sopel commented 11 years ago

This issue triggers and thus supersedes #122.

dpb587 commented 11 years ago

I've confirmed that there are three copies of the same message in some of the early morning hours from log replays.

dpb587 commented 11 years ago

Well, at least this is a known issue... https://logstash.jira.com/browse/LOGSTASH-429

sopel commented 11 years ago

Moved to Icebox due to upstream dependency and low priority (which might need to be increased again once the number of shippers and handled environments increases as well).

mrdavidlaing commented 11 years ago

Re-prioritised for next milestone; reboots happen, and with this issue effectively causing double logs to end up in ElasticSearch, reports become unusable.

We're going to have to figure out a way to escalate this issue; either by fixing ourselves or paying a core maintainer to fix it.

dpb587 commented 11 years ago

I'm not familiar with cygwin, but would running logstash under cygwin accidentally work around any issues due to the way cygwin may handle network paths and virtualizing the file system?

mrdavidlaing commented 11 years ago

Hard to say, since we're monitoring log files on windows file shares.

dpb587 commented 11 years ago

This seems potentially relevant: https://github.com/jordansissel/ruby-filewatch/pull/19

I'll see if maybe I can compile a JAR file with it for testing.

mrdavidlaing commented 11 years ago

@dpb587 - That looks very promising.

Let us know how the JAR recompliation works

mrdavidlaing commented 11 years ago

Not sure if I understand the functionality correctly, but the uuid functions It's in logstash 1.2.0 might help prevent duplicates in these kind of scenarios - http://logstash.net/docs/1.2.0.beta2/filters/uuid

sopel commented 11 years ago

@mrdavidlaing - I think the purpose of the uuid function is the opposite of what's desired here, i.e. it will ensure that all inserted duplicates are deduplicated not by folding into one (or skipping duplicates), but transforming each message via adding a uuid so that it's going to be unique in turn.

The docs are a bit confusing indeed, but looking at the underlying commit https://github.com/bobtfish/logstash/commit/1c85e2e79fe19680bd998aaf354b524ff04309c6#L0R44 seems to confirm that it lives up to its name and just replaces or amends the configured fields with SecureRandom.uuid in order to make them unique in comparison to existing sibling messages.

mrdavidlaing commented 11 years ago

@sopel - thanks for tracking down the internals of uuid - I agree, its not what we want. If anything, we probably want an MD5 style hash on all the data to be used for the _id

dpb587 commented 11 years ago

I spent several hours on this yesterday trying to get this working, but I haven't had any success yet. My inexperience in both Ruby, Java, and JRuby is a factor. It's been fairly easy to get a re-compiled logstash JAR with custom code, but his submitted patch seems to be missing some code.

I'm going to send the guy a message to see if I can get further pointers.

mrdavidlaing commented 11 years ago

Good to hear you can recompile LogStash.

Agree that getting some feedback from bobtfish is the right approach at this stage.

dpb587 commented 11 years ago

@mrdavidlaing, when you get a chance, would you mind giving the following a try: logstash-1.1.13-monolithic-pr19b1.jar. I tested this on Windows 2008 R2 SP1 with Java SE 1.7.0_25-b17 using a shared network path file exported by the local host. This may not fix nightly log rotation issues, but let's confirm this better supports us restarting mid-day to start.

Where the official logstash-1.1.13-flatjar.jar writes .sincedb lines like:

0 0 2 806

The patched logstash-1.1.13-monolithic-pr19b1.jar writes .sincedb lines like:

3773355946-103005-65536 0 2 946

Current repro steps are:

wget -O jdk-7u25-linux-x64.tar.gz 'http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25-linux-x64.tar.gz'
tar -xzf jdk-7u25-linux-x64.tar.gz
export PATH="/home/ubuntu/jdk1.7.0_25/bin:$PATH"

curl -L https://get.rvm.io | bash
. /home/ubuntu/.rvm/scripts/rvm

rvm install jruby-1.7.3
rvm use jruby-1.7.3

sudo apt-get install -y git

git clone https://github.com/logstash/logstash.git
cd logstash/
git checkout v1.1.13

sudo apt-get install -y make

ruby gembag.rb logstash.gemspec
cd vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/
wget -O - 'https://github.com/jordansissel/ruby-filewatch/pull/19.diff' | patch -p1
cd ../../../../../../

make jar jar-test

aws s3api put-object --bucket ci-elasticsearch-development-flow --key logstash-1.1.13-monolithic-pr19b1.jar --acl public-read --body build/logstash-1.1.13-monolithic.jar

mrdavidlaing commented 11 years ago

@dpb587 - I've switched to using the patched jar for https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other.

The .since_db's definately have more info in them, as you describe above.

:+1: As far as I can tell, after stopping and restarting the service, we no longer get duplicate logs shipped.

I'll leave the service running overnight, and we can see how it deals with an IIS log rotate (which creates a new filename)

mrdavidlaing commented 11 years ago

Worth noting that the "Windows 2012 limitation" only applies to Windows 2012 ReFS volumes, not to the standard NTFS volumes that are still default on Windows 2012 (and used in Windows 2008 etc)

dpb587 commented 11 years ago

@mrdavidlaing, here's another build for you to try: logstash-1.1.13-monolithic-pr19b2.jar. This should include the more Windows-native handling referenced in the gist.

When I tested it, b1 would error when I tried to rotate files:

$ mv test.log test.log-$$ && echo `date` hello >> test.log
Device or resource is busy

Whereas this proposed b2 allowed the file rotation and logstash successfully picked up the new line of the new file immediately. In my mind, that at least confirms I was able to patch something differently and, potentially, is respecting files in a more Windows-native way.

Additional patch notes (you can probably easily make a mockery of my java understandings)...

wget -O - 'https://github.com/edwinf/ruby-filewatch/commit/5e676dc709297602c6fdfe1e5a23acfecaef8646.diff' | patch -Rp1
...
mkdir /tmp/logstash-patch
cp build/logstash-1.1.13-monolithic.jar /tmp/logstash-patch/
cd /tmp/logstash-patch/
cd ./META-INF/jruby.home/lib/ruby/shared/
wget 'https://gist.github.com/edwinf/6329520/raw/10eb5344362ba816245e9492b4840e17f5813b9d/RubyFileExt.java'
javac -classpath ../../../../../logstash-1.1.13-monolithic.jar RubyFileExt.java
jar cvf JRubyFileExtension.jar RubyFileExt.class
cd ../../../../../
jar uf logstash-1.1.13-monolithic.jar META-INF/jruby.home/lib/ruby/shared/JRubyFileExtension.jar

mrdavidlaing commented 11 years ago

I've switched

https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
https://github.com/cityindex/logsearch-on-aws/tree/master/live-cluster/shippers/diagnostics

to logstash-1.1.13-monolithic-pr19b2.jar.

Both seem to be working correctly.

mrdavidlaing commented 11 years ago

All shippers are now running logstash-1.1.13-monolithic-pr19b2.jar

https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/log4net
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/diagnostics
https://github.com/cityindex/logsearch-on-aws/tree/master/live-cluster/shippers/diagnostics

Will run these overnight to test whether log rotation works.

dpb587 commented 11 years ago

As I'm looking through logstash/windows/file bugs, I've rediscovered LOGSTASH-394. Has to do with me seeing the Device or resource is busy when trying to rotate when testing. The fix is included with b2, just good reference and background knowledge.

dpb587 commented 11 years ago

A more recent, open ticket of 394... LOGSTASH-1123

dpb587 commented 11 years ago

Should've just waited; these issues turn out to be commonly reported...

LOGSTASH-394 - already mentioned
LOGSTASH-1123 - already mentioned
LOGSTASH-986 - has (inefficient) patch which closes/opens file handles every time.
LOGSTASH-886
LOGSTASH-578

sopel commented 11 years ago

Form LOGSTASH-1123:

The correct fix would be to open the file in read-only / shared mode, which doesn't seem to be a concept available in ruby.

Is that actually true? I'd be kinda sh... - let's say surprised.

sopel commented 11 years ago

Well, at least Ruby 2.0.0 features flock(locking_constant), which offers LOCK_SH:

Shared lock. Multiple processes may each hold a shared lock for a given file at the same time.

And read-only has definitely been available with Ruby 1.8+ already, as expected - everything else would have been more than weird in fact ...

sopel commented 11 years ago

... and LOCK_SH is also Ruby 1.8+ at least, so the issue offers a slightly odd description/interpretation of the encountered problem.

dpb587 commented 11 years ago

I don't quite know the boundaries of jruby vs ruby vs java., but also note the java issue referenced by the patch we're using: http://bugs.sun.com/view_bug.do?bug_id=6357433.

... opening a file on Windows using a FileInputStream causes other processes to be unable to delete the file. This is intended behavior as the code is currently written, as the sharing flags specified during the opening of the file are FILE_SHARE_READ and FILE_SHARE_WRITE ...

sopel commented 11 years ago

Very good point, entirely forgot about that aspect - ironically the JRuby implementation used to be significantly ahead of native Ruby in various areas due to facilitating the mature and high performance JVM (though this difference has decreased meanwhile I think). However, that bug surely affects JRuby on Windows only indeed - have you encountered the listed workaround for JDK 7 being used in one of the various issues/patches?

For jdk7, the workaround is to use new file system API. So instead of new FileInputStream(f) and new FileOutputStream(f), use f.toPath().newInputStream() and f.toPath().newOutputStream().

dpb587 commented 11 years ago

I know the gist has fixed the issue as of b2, but I believe it uses a slightly different approach than the listed workaround.

mrdavidlaing commented 11 years ago

With b2,

Log file rotation where the log file name remains the same (and the old file gets renamed) appears to be working.
Log file rotation where the new log file has a new name doesn't appear to be working in https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other -> specifically for IIS logs

Its possible that this fix: https://github.com/cityindex/logsearch-development-flow/issues/174#issuecomment-24123708 might also resolve this issue

mrdavidlaing commented 11 years ago

@dpb587 - I will upgrade everything to logstash-1.1.13-monolithic-pr19b3.jar and start testing again

mrdavidlaing commented 11 years ago

Log file rotation where the new log file has a new name doesn't appear to be working ...

This could be due to a permissions problem which has now been fixed.

mrdavidlaing commented 11 years ago

All shippers updated to logstash-1.1.13-monolithic-pr19b5.jar.

Initial tests seem positive - will have to run overnight to check that log rotation works.

sopel commented 11 years ago

Considered done by @mrdavidlaing and @dpb587, who are going to test this during the upcoming milestone - upstream integration is going to be handled via #178 in case.

mrdavidlaing commented 11 years ago

Shippers running logstash-1.1.13-monolithic-pr19b5.jar seem to handle log rotation correctly.

cityindex-attic / logsearch

Restarting LogStash shipper appears to resend all the data #126