awslabs / amazon-kinesis-agent

Continuously monitors a set of log files and sends new data to the Amazon Kinesis Stream and Amazon Kinesis Firehose in near-real-time.
Other
356 stars 222 forks source link

Is there any possibility that record missing or duplicated? #17

Open DON1101 opened 8 years ago

DON1101 commented 8 years ago

Hi there,

We are using Kinesis agent for our log monitoring and it's really helping us!

However, we encountered some situations where: 1) our logs are missing (or partially missing in single line log) 2) our logs are sent to Kinesis stream more than once

Is there any possibility that record will be missing or duplicated in the current Kinesis agent version? I saw these comment lines in source code:

/* * TODO: If a Truncate/Rename rotation happen while this method is running, * behavior is undefined. Should we lock the current file? Would * this prevent truncation and renaming? * @param newSnapshot * @throws IOException / @VisibleForTesting boolean updateCurrentFile(TrackedFileList newSnapshot) throws IOException {

PS: 1) Our Kinesis agent config file:

{ "checkpointFile": "/tmp/aws-kinesis-agent-checkpoints", "cloudwatch.emitMetrics": false, "kinesis.endpoint": "https://kinesis.ap-southeast-1.amazonaws.com", "awsAccessKeyId": "xxx", "awsSecretAccessKey": "yyy", "flows": [ { "filePattern": "/mnt/logs/search.*log", "multiLineStartPattern": "ProjectName:ComponentName", "kinesisStream": "stream-name", "partitionKeyOption": "RANDOM", "maxBufferAgeMillis": "1000" } ] }

2) We are using logrotate tool for our log rotation.

chaochenq commented 8 years ago

Hi, thanks for reporting this. Ideally there shouldn't be any data loss, but could be duplicated data (in rare cases though).

But you are right. We seem to have a bug when logs are rotated by copytruncate. Some customers have reported that the agent is not able to identify the file to tail after a rotation occurs. If somehow agent decided to tail the rotated file rather than the new file, then unfortunately both data loss and duplicate happen...

Similar issue was reported here: https://github.com/awslabs/amazon-kinesis-agent/issues/14

daroczig commented 8 years ago

Any updates on this? Or suggestions on how to avoid the bug resulting in data loss? We missed a few events in one of our streams and I'm having hard times trying to figure out what have hapenned.

chaochenq commented 8 years ago

@daroczig Did you apply any logrotate rules (in particular copytruncate) to your files? Or did you shutdown your instance and never restart again? These are some situations that I have off the top of my head. Did you run into other situations that could cause data loss?

daroczig commented 8 years ago

Thanks for your prompt reply, @chaochenq, and also for working on this very useful tool.

Regarding your question, there's only one file monitored and we have the below logrotate config for that:

/path/to/file{
    su www-data www-data
    rotate 7
    dateext
    daily
    missingok
    create 644 www-data www-data
}

There's only one instance, and it wasn't even restarted around or since the data hole, where we missed a couple of events in the same second. I cannot reproduce this and haven't seen such problem in the past few weeks. The aws-kinesis-agent.log doesn't show anything strange in the time period.

chaochenq commented 8 years ago

Hmm... that's weird, FYI we have canaries and functional tests continuously running on our backend to detect the data loss, normally there shouldn't be any. Did you check your Agent CloudWatch metrics (under AWSKinesisAgent namespace) to see if the sendAttempts and success lines up with the number of events you get from Kinesis/Firehose side?

I'd recommend you also check the CW metrics if you don't see anything strange in agent log. BTW do you use Agent for Kinesis Streams or Kinesis Firehose? Agent is only responsible for putting your data into a stream, it has no control over anything beyond the streams. So anything happened on your Kinesis consumer app or Firehose delivery streams could also cause a data loss, although it's very rare...

daroczig commented 8 years ago

Thanks, @chaochenq! Checking the CW metrics, but no clue so far (the RecordSendAttepts and IncomingRecords values do not seem to match in general).

We are using Kinesis Firehose, writing to Redshift and S3. I've looked for the missing events in both, without luck. I'll add some extra monitoring and logging before writing to the file processed by the AWS Kinesis Agent and will report back if I see this happening again.

chaochenq commented 8 years ago

@daroczig No problems. BTW RecordSendAttepmts from Agent always doesn't match the incoming records. One attempt is just the attempt that Agent tries to send to Kinesis/Firehose, it will retry upon the returned result. So RecordSendAttepmts could be higher than IncomingRecords. Probably you want to look at BytesSent?

daroczig commented 8 years ago

We had another missing event over the weekend, but fortunately found something useful in the logs:

2016-09-10 06:51:00.499+0000 ***** (FileTailer[fh:*****:/var/log/*****]) com.amazon.kinesis.streaming.agent.tailing.SourceFileTracker [ERROR] Error while tracking file rotation for /var/log/*****: All files have gone away!
        Current file: TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)
        Current file snapshot (newest first):
                TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)
        New file snapshot (newest first):
                TrackedFile(id=(dev=ca01,ino=135832), path=/var/log/*****, lastModifiedTime=1473490260000, size=0)
2016-09-10 06:51:01.591+0000 ***** (FileTailer[fh:fis_alerts_noflume:/var/log/*****]) com.amazon.kinesis.streaming.agent.tailing.FileTailer [WARN] FileTailer[fh:*****:/var/log/*****]: Failed to start tailing from previous checkpoint TrackedFile(id=(dev=ca01,ino=136637), path=/var/log/*****, lastModifiedTime=1473490252000, size=207149253)@207156415

Any ideas how to overcome this?

chaochenq commented 8 years ago

@daroczig how are you rotating the files? Looks like at that time your old files were removed while new files matching the pattern didn't come up, so agent lost its context and didn't know which file to track

daroczig commented 8 years ago

@chaochenq, I have the above (posted 5 days ago) logrotate config in place -- I know no other rotation method specified for this file, but double checking.

Update: checking the syslog suggest that the logrotate script indeed run at 6:51 on the given day, the related logs messages:

Sep 10 06:45:01 host CRON[18338]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:46:01 host CRON[18510]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:47:01 host CRON[19015]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:48:01 host CRON[19185]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:49:01 host CRON[19366]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:51:00 host cracklib: no dictionary update necessary.
Sep 10 06:51:00 host CRON[20193]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep  9 07:01:01 host CRON[20374]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:52:01 host CRON[20375]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:53:01 host CRON[20879]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)
Sep 10 06:54:01 host CRON[21036]: (root) CMD (/usr/bin/aws-kinesis-agent-babysit > /dev/null 2>&1)

I really don't know and it's pretty unexpected why's that Sep 9 07:01:01 event in this series, any ideas?

chaochenq commented 8 years ago

This looks like a cron log message, can I take a look at your syslog message?

And I really have no idea why that cron job shows as 09/09 07:01...

daroczig commented 8 years ago

@chaochenq that's all I have in the syslog for the related time period. This is on Ubuntu 14.04.5 LTS

chaochenq commented 8 years ago

I actually never personally tested Agent on Ubuntu, the Ubuntu bootstrap script was contributed by this community (which I appreciated). Maybe the cron job and logrotator behaves slightly differently between Red Hat and Ubuntu. Can I ask that do you see that definitely happen every time when a rotation takes place? Or just randomly at a few times?

daroczig commented 8 years ago

Pretty randomly -- happened only twice in the next few weeks. Do you have any suggestions on how to try to debug this?

chaochenq commented 8 years ago

I'd try the following things:

1) enable DEBUG or TRACE in agent config (be careful if you want to do this, since this will make your agent log spammy) 2) try using some other rotate mode like rename or copy instead of create to see if the issue persists 3) looking at the syslog, I'm still not sure why cracklib was invoked during the logrotate and what the message means, can you relate that with your log rotation? 4) one more thing to keep in mind is that, we recommend that you narrow down the each tailing filePattern to match 1 file that you are actively tailing. e.g, if you have /path/to/dir/logA and /path/to/dir/logB, and the rotation would create /path/to/dir/logA.1, /path/to/dir/logA.2, /path/to/dir/logA.3, /path/to/dir/logB.1, /path/to/dir/logB.2, /path/to/dir/logB.3 ... We recommend that you use two file flows for /path/to/dir/logA* and /path/to/dir/logB* instead of one file flow for /path/to/dir/log*

daroczig commented 8 years ago

Thanks a lot for the pointers! Based on point (4), I think I know what might be the problem. The flows part is defined as below in the config file on our end:

"flows": [
    {
      "filePattern": "/path/to/file",
      "deliveryStream": "***",
      "maxBufferSizeRecords": 1
    }

And now I suspect the filePattern should be rather /path/to/file* -- so with a star at the end pointing to all rotated logs, right? Would this explain the missing very few and very rare events?

chaochenq commented 8 years ago

Yeah I think so. If you only point to your tailed file, since you use create mode, there could be a low chance when the agent is checking files to tail but your file doesn't actually exist (hasn't been created yet), due to which the agent then loses the tailing context.