Closed sharms closed 7 years ago
Seems like this story needs an implementation sketch before we could think about moving forward on it. So... how do we propose to actually do this? I see a few possibilities:
This seems to be what we are doing today. We'd just need to list everything else here and commit to keeping it up-to-date.
See Compartmentalizing CloudWatch Logs Agent Configuration Files for an idea of how we could do this. We could have each release / job write a configuration file out declaring which of it's logs should be backed up.
I'm envisioning we'd have a small script that would iterate over something like /var/vcap/sys/logs/*/*.log
and append each file to the awslogs configuration before start.
Thoughts? Better Ideas?
Testing more comprehensive log capture on a test host.
Committing my initial attempt at the pre-start script: https://github.com/18F/cg-awslogs-boshrelease/pull/5
Needs more work tomorrow, tho --
The above PR seems ready to me. If @cnelson approves and merges, I can test and then push this to the next column.
Changes merged to master -- now verifying we are seeing new log events in AWS console.
Deployed awslogs changes to logsearch --
Didn't work because the additional config dir was being improperly specified.
Okay, in testing it looks as if {instance_id} is not unique enough and is causing zillions of this error:
cwlogs.push.publisher - WARNING - 12345 - Thread-67 - Multiple agents might be sending log events to log stream (nats_to_syslog i-a89101112) with sequence token (LONG_STRING_OF_DIGITS|none). This could cause duplicates and is not recommended.
I might need to amend the log stream identifier, or there might be another problem.
Okay, after testing different log_stream_name configs (didn't have much hope for that), I conclude that the aggressive harvesting of all files under /var/vcap/sys/log
is causing sequence token errors. I'm going to be smarter about which files to ingest in the pre-start script, because gathering all valid logs for compliance is the most important goal here.
Okay, I have a unique log stream name & a sane group name and am excluding very obviously non-ingestable logs from the config (files that are not text or empty, and files that have a rotation timestamp) to avoid uselessly spawning log readers. No sequence errors nor multiple-agent errors any more.
Just waiting on a thumbs-up and merge so we can redeploy and test.
Waiting on merge -- https://github.com/18F/cg-awslogs-boshrelease/pull/7
I'm seeing logs with the new log stream name pattern everywhere I look in CloudWatch. I'm just not sure if there's any deploy I need to manually kick off to be sure we're collecting everything everywhere, but this looks good to me.
@sharms will accept later today
The current implementation triggers on pre-start
which means that it will be executed before any jobs start. The current code will not add any logs during this time, as it searches for log files which have been written to (1), which do not exist on a stemcell upgrade.
This worked on staging as these systems are frequently rebooted without cause, however it is rare we reboot production.
Recommendations:
References:
Good catch! My bad for suggesting pre-start.
@datn Did you address Steve's feedback about this running in pre-start?
I'm not sure what the best approach is here, but I can see a few options:
Post-deploy might be a better place to move this, since we'd want all jobs to have started before we set up awslogs: https://bosh.io/docs/post-deploy.html. Or a cron job.
I missed the notification that this went back into ready, will look at it and everyone's helpful suggestions today.
I'm not understanding the problem @sharms points out, so I'll ask for clarification in channel.
Tested on admin-ui in staging, and while new logs are being added to the awslogs config file, they do not appear to be making it to cloudwatch.
I think the problem is that monit is not in the root user's path by default, so this call to restart isn't executing.
https://github.com/18F/cg-awslogs-boshrelease/pull/12 uses a new approach to ensure logs are captured even on short-lived / ephemeral VMs
@datn can you accept this one?
Accepting this one.
In order have all platform logs available for auditing as Cloud Operations I want to send bosh job logs to CloudWatch.
Acceptance Criteria
Implementation Sketch
/var/vcap/sys/logs/*/*