Closed mrdavidlaing closed 11 years ago
This issue triggers and thus supersedes #122.
I've confirmed that there are three copies of the same message in some of the early morning hours from log replays.
Well, at least this is a known issue... https://logstash.jira.com/browse/LOGSTASH-429
Moved to Icebox due to upstream dependency and low priority (which might need to be increased again once the number of shippers and handled environments increases as well).
Re-prioritised for next milestone; reboots happen, and with this issue effectively causing double logs to end up in ElasticSearch, reports become unusable.
We're going to have to figure out a way to escalate this issue; either by fixing ourselves or paying a core maintainer to fix it.
I'm not familiar with cygwin, but would running logstash under cygwin accidentally work around any issues due to the way cygwin may handle network paths and virtualizing the file system?
Hard to say, since we're monitoring log files on windows file shares.
This seems potentially relevant: https://github.com/jordansissel/ruby-filewatch/pull/19
I'll see if maybe I can compile a JAR file with it for testing.
@dpb587 - That looks very promising.
Let us know how the JAR recompliation works
Not sure if I understand the functionality correctly, but the uuid functions It's in logstash 1.2.0 might help prevent duplicates in these kind of scenarios - http://logstash.net/docs/1.2.0.beta2/filters/uuid
@mrdavidlaing - I think the purpose of the uuid function is the opposite of what's desired here, i.e. it will ensure that all inserted duplicates are deduplicated not by folding into one (or skipping duplicates), but transforming each message via adding a uuid so that it's going to be unique in turn.
The docs are a bit confusing indeed, but looking at the underlying commit https://github.com/bobtfish/logstash/commit/1c85e2e79fe19680bd998aaf354b524ff04309c6#L0R44 seems to confirm that it lives up to its name and just replaces or amends the configured fields with SecureRandom.uuid
in order to make them unique in comparison to existing sibling messages.
@sopel - thanks for tracking down the internals of uuid
- I agree, its not what we want. If anything, we probably want an MD5 style hash on all the data to be used for the _id
I spent several hours on this yesterday trying to get this working, but I haven't had any success yet. My inexperience in both Ruby, Java, and JRuby is a factor. It's been fairly easy to get a re-compiled logstash JAR with custom code, but his submitted patch seems to be missing some code.
I'm going to send the guy a message to see if I can get further pointers.
Good to hear you can recompile LogStash.
Agree that getting some feedback from bobtfish is the right approach at this stage.
@mrdavidlaing, when you get a chance, would you mind giving the following a try: logstash-1.1.13-monolithic-pr19b1.jar. I tested this on Windows 2008 R2 SP1 with Java SE 1.7.0_25-b17 using a shared network path file exported by the local host. This may not fix nightly log rotation issues, but let's confirm this better supports us restarting mid-day to start.
Where the official logstash-1.1.13-flatjar.jar
writes .sincedb
lines like:
0 0 2 806
The patched logstash-1.1.13-monolithic-pr19b1.jar
writes .sincedb
lines like:
3773355946-103005-65536 0 2 946
Current repro steps are:
wget -O jdk-7u25-linux-x64.tar.gz 'http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25-linux-x64.tar.gz'
tar -xzf jdk-7u25-linux-x64.tar.gz
export PATH="/home/ubuntu/jdk1.7.0_25/bin:$PATH"
curl -L https://get.rvm.io | bash
. /home/ubuntu/.rvm/scripts/rvm
rvm install jruby-1.7.3
rvm use jruby-1.7.3
sudo apt-get install -y git
git clone https://github.com/logstash/logstash.git
cd logstash/
git checkout v1.1.13
sudo apt-get install -y make
ruby gembag.rb logstash.gemspec
cd vendor/bundle/jruby/1.9/gems/filewatch-0.5.1/
wget -O - 'https://github.com/jordansissel/ruby-filewatch/pull/19.diff' | patch -p1
cd ../../../../../../
make jar jar-test
aws s3api put-object --bucket ci-elasticsearch-development-flow --key logstash-1.1.13-monolithic-pr19b1.jar --acl public-read --body build/logstash-1.1.13-monolithic.jar
@dpb587 - I've switched to using the patched jar for https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
.
The .since_db's definately have more info in them, as you describe above.
:+1: As far as I can tell, after stopping and restarting the service, we no longer get duplicate logs shipped.
I'll leave the service running overnight, and we can see how it deals with an IIS log rotate (which creates a new filename)
Worth noting that the "Windows 2012 limitation" only applies to Windows 2012 ReFS volumes, not to the standard NTFS volumes that are still default on Windows 2012 (and used in Windows 2008 etc)
@mrdavidlaing, here's another build for you to try: logstash-1.1.13-monolithic-pr19b2.jar. This should include the more Windows-native handling referenced in the gist.
When I tested it, b1
would error when I tried to rotate files:
$ mv test.log test.log-$$ && echo `date` hello >> test.log
Device or resource is busy
Whereas this proposed b2
allowed the file rotation and logstash successfully picked up the new line of the new file immediately. In my mind, that at least confirms I was able to patch something differently and, potentially, is respecting files in a more Windows-native way.
Additional patch notes (you can probably easily make a mockery of my java understandings)...
wget -O - 'https://github.com/edwinf/ruby-filewatch/commit/5e676dc709297602c6fdfe1e5a23acfecaef8646.diff' | patch -Rp1
...
mkdir /tmp/logstash-patch
cp build/logstash-1.1.13-monolithic.jar /tmp/logstash-patch/
cd /tmp/logstash-patch/
cd ./META-INF/jruby.home/lib/ruby/shared/
wget 'https://gist.github.com/edwinf/6329520/raw/10eb5344362ba816245e9492b4840e17f5813b9d/RubyFileExt.java'
javac -classpath ../../../../../logstash-1.1.13-monolithic.jar RubyFileExt.java
jar cvf JRubyFileExtension.jar RubyFileExt.class
cd ../../../../../
jar uf logstash-1.1.13-monolithic.jar META-INF/jruby.home/lib/ruby/shared/JRubyFileExtension.jar
I've switched
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
https://github.com/cityindex/logsearch-on-aws/tree/master/live-cluster/shippers/diagnostics
to logstash-1.1.13-monolithic-pr19b2.jar.
Both seem to be working correctly.
All shippers are now running logstash-1.1.13-monolithic-pr19b2.jar
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/log4net
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/diagnostics
https://github.com/cityindex/logsearch-on-aws/tree/master/live-cluster/shippers/diagnostics
Will run these overnight to test whether log rotation works.
As I'm looking through logstash/windows/file bugs, I've rediscovered LOGSTASH-394. Has to do with me seeing the Device or resource is busy
when trying to rotate when testing. The fix is included with b2
, just good reference and background knowledge.
A more recent, open ticket of 394... LOGSTASH-1123
Should've just waited; these issues turn out to be commonly reported...
Form LOGSTASH-1123:
The correct fix would be to open the file in read-only / shared mode, which doesn't seem to be a concept available in ruby.
Is that actually true? I'd be kinda sh... - let's say surprised.
Well, at least Ruby 2.0.0 features flock(locking_constant
), which offers LOCK_SH
:
Shared lock. Multiple processes may each hold a shared lock for a given file at the same time.
And read-only has definitely been available with Ruby 1.8+ already, as expected - everything else would have been more than weird in fact ...
... and LOCK_SH
is also Ruby 1.8+ at least, so the issue offers a slightly odd description/interpretation of the encountered problem.
I don't quite know the boundaries of jruby vs ruby vs java., but also note the java issue referenced by the patch we're using: http://bugs.sun.com/view_bug.do?bug_id=6357433.
... opening a file on Windows using a FileInputStream causes other processes to be unable to delete the file. This is intended behavior as the code is currently written, as the sharing flags specified during the opening of the file are FILE_SHARE_READ and FILE_SHARE_WRITE ...
Very good point, entirely forgot about that aspect - ironically the JRuby implementation used to be significantly ahead of native Ruby in various areas due to facilitating the mature and high performance JVM (though this difference has decreased meanwhile I think). However, that bug surely affects JRuby on Windows only indeed - have you encountered the listed workaround for JDK 7 being used in one of the various issues/patches?
For jdk7, the workaround is to use new file system API. So instead of new FileInputStream(f) and new FileOutputStream(f), use f.toPath().newInputStream() and f.toPath().newOutputStream().
I know the gist has fixed the issue as of b2
, but I believe it uses a slightly different approach than the listed workaround.
With b2,
https://github.com/cityindex/logsearch-on-aws/tree/master/ppe-cluster/shippers/other
-> specifically for IIS logsIts possible that this fix: https://github.com/cityindex/logsearch-development-flow/issues/174#issuecomment-24123708 might also resolve this issue
@dpb587 - I will upgrade everything to logstash-1.1.13-monolithic-pr19b3.jar and start testing again
Log file rotation where the new log file has a new name doesn't appear to be working ...
This could be due to a permissions problem which has now been fixed.
All shippers updated to logstash-1.1.13-monolithic-pr19b5.jar.
Initial tests seem positive - will have to run overnight to check that log rotation works.
Considered done by @mrdavidlaing and @dpb587, who are going to test this during the upcoming milestone - upstream integration is going to be handled via #178 in case.
Shippers running logstash-1.1.13-monolithic-pr19b5.jar seem to handle log rotation correctly.
When restarting the LogStash shipper service we get a larger volume of messages sent that would be expected if were were just catching up from a last log event sent.
Its almost as if the since_db data is being ignored, and all messages from the beginning of the log files are resent.
At this stage this is just a speculation as to the cause of the surge, so first step will be to isolate what the actual surge of messages look like.