Open DON1101 opened 8 years ago
Hi, thanks for reporting this. Ideally there shouldn't be any data loss, but could be duplicated data (in rare cases though).
But you are right. We seem to have a bug when logs are rotated by copytruncate. Some customers have reported that the agent is not able to identify the file to tail after a rotation occurs. If somehow agent decided to tail the rotated file rather than the new file, then unfortunately both data loss and duplicate happen...
Similar issue was reported here: https://github.com/awslabs/amazon-kinesis-agent/issues/14
Any updates on this? Or suggestions on how to avoid the bug resulting in data loss? We missed a few events in one of our streams and I'm having hard times trying to figure out what have hapenned.
@daroczig Did you apply any logrotate rules (in particular copytruncate) to your files? Or did you shutdown your instance and never restart again? These are some situations that I have off the top of my head. Did you run into other situations that could cause data loss?
Thanks for your prompt reply, @chaochenq, and also for working on this very useful tool.
Regarding your question, there's only one file monitored and we have the below logrotate
config for that:
/path/to/file{
su www-data www-data
rotate 7
dateext
daily
missingok
create 644 www-data www-data
}
There's only one instance, and it wasn't even restarted around or since the data hole, where we missed a couple of events in the same second. I cannot reproduce this and haven't seen such problem in the past few weeks. The aws-kinesis-agent.log doesn't show anything strange in the time period.
Hmm... that's weird, FYI we have canaries and functional tests continuously running on our backend to detect the data loss, normally there shouldn't be any. Did you check your Agent CloudWatch metrics (under AWSKinesisAgent namespace) to see if the sendAttempts and success lines up with the number of events you get from Kinesis/Firehose side?
I'd recommend you also check the CW metrics if you don't see anything strange in agent log. BTW do you use Agent for Kinesis Streams or Kinesis Firehose? Agent is only responsible for putting your data into a stream, it has no control over anything beyond the streams. So anything happened on your Kinesis consumer app or Firehose delivery streams could also cause a data loss, although it's very rare...
Thanks, @chaochenq! Checking the CW metrics, but no clue so far (the RecordSendAttepts and IncomingRecords values do not seem to match in general).
We are using Kinesis Firehose, writing to Redshift and S3. I've looked for the missing events in both, without luck. I'll add some extra monitoring and logging before writing to the file processed by the AWS Kinesis Agent and will report back if I see this happening again.
@daroczig No problems. BTW RecordSendAttepmts from Agent always doesn't match the incoming records. One attempt is just the attempt that Agent tries to send to Kinesis/Firehose, it will retry upon the returned result. So RecordSendAttepmts could be higher than IncomingRecords. Probably you want to look at BytesSent?
We had another missing event over the weekend, but fortunately found something useful in the logs:
2016-09-10 06:51:00.499+0000 ***** (FileTailer[fh:*****:/var/log/*****]) com.amazon.kinesis.streaming.agent.tailing.SourceFileTracker [ERROR] Error while tracking file rotation for /var/log/*****: All files have gone away!
Current file: TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)
Current file snapshot (newest first):
TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)
New file snapshot (newest first):
TrackedFile(id=(dev=ca01,ino=135832), path=/var/log/*****, lastModifiedTime=1473490260000, size=0)
2016-09-10 06:51:01.591+0000 ***** (FileTailer[fh:fis_alerts_noflume:/var/log/*****]) com.amazon.kinesis.streaming.agent.tailing.FileTailer [WARN] FileTailer[fh:*****:/var/log/*****]: Failed to start tailing from previous checkpoint TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)@207156415
Any ideas how to overcome this?
@daroczig how are you rotating the files? Looks like at that time your old files were removed while new files matching the pattern didn't come up, so agent lost its context and didn't know which file to track
@chaochenq, I have the above (posted 5 days ago) logrotate
config in place -- I know no other rotation method specified for this file, but double checking.
Update: checking the syslog
suggest that the logrotate script indeed run at 6:51 on the given day, the related logs messages:
Sep 10 06:45:01 host CRON[18338]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:46:01 host CRON[18510]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:47:01 host CRON[19015]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:48:01 host CRON[19185]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:49:01 host CRON[19366]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:51:00 host cracklib: no dictionary update necessary.
Sep 10 06:51:00 host CRON[20193]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 9 07:01:01 host CRON[20374]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:52:01 host CRON[20375]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:53:01 host CRON[20879]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:54:01 host CRON[21036]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
I really don't know and it's pretty unexpected why's that Sep 9 07:01:01
event in this series, any ideas?
This looks like a cron log message, can I take a look at your syslog message?
And I really have no idea why that cron job shows as 09/09 07:01...
@chaochenq that's all I have in the syslog
for the related time period. This is on Ubuntu 14.04.5 LTS
I actually never personally tested Agent on Ubuntu, the Ubuntu bootstrap script was contributed by this community (which I appreciated). Maybe the cron job and logrotator behaves slightly differently between Red Hat and Ubuntu. Can I ask that do you see that definitely happen every time when a rotation takes place? Or just randomly at a few times?
Pretty randomly -- happened only twice in the next few weeks. Do you have any suggestions on how to try to debug this?
I'd try the following things:
1) enable DEBUG or TRACE in agent config (be careful if you want to do this, since this will make your agent log spammy) 2) try using some other rotate mode like rename or copy instead of create to see if the issue persists 3) looking at the syslog, I'm still not sure why cracklib was invoked during the logrotate and what the message means, can you relate that with your log rotation? 4) one more thing to keep in mind is that, we recommend that you narrow down the each tailing filePattern to match 1 file that you are actively tailing. e.g, if you have /path/to/dir/logA and /path/to/dir/logB, and the rotation would create /path/to/dir/logA.1, /path/to/dir/logA.2, /path/to/dir/logA.3, /path/to/dir/logB.1, /path/to/dir/logB.2, /path/to/dir/logB.3 ... We recommend that you use two file flows for /path/to/dir/logA* and /path/to/dir/logB* instead of one file flow for /path/to/dir/log*
Thanks a lot for the pointers! Based on point (4), I think I know what might be the problem. The flows part is defined as below in the config file on our end:
"flows": [
{
"filePattern": "/path/to/file",
"deliveryStream": "***",
"maxBufferSizeRecords": 1
}
And now I suspect the filePattern
should be rather /path/to/file*
-- so with a star at the end pointing to all rotated logs, right? Would this explain the missing very few and very rare events?
Yeah I think so. If you only point to your tailed file, since you use create mode, there could be a low chance when the agent is checking files to tail but your file doesn't actually exist (hasn't been created yet), due to which the agent then loses the tailing context.
Hi there,
We are using Kinesis agent for our log monitoring and it's really helping us!
However, we encountered some situations where: 1) our logs are missing (or partially missing in single line log) 2) our logs are sent to Kinesis stream more than once
Is there any possibility that record will be missing or duplicated in the current Kinesis agent version? I saw these comment lines in source code:
PS: 1) Our Kinesis agent config file:
2) We are using logrotate tool for our log rotation.