cloud-gov / cg-atlas

Repository hosting issues and artifacts related to operations of the cloud.gov platform
Creative Commons Zero v1.0 Universal
3 stars 1 forks source link

Capture all host logs for auditing purposes #156

Closed sharms closed 7 years ago

sharms commented 7 years ago

In order have all platform logs available for auditing as Cloud Operations I want to send bosh job logs to CloudWatch.

Acceptance Criteria

Implementation Sketch

cnelson commented 7 years ago

Seems like this story needs an implementation sketch before we could think about moving forward on it. So... how do we propose to actually do this? I see a few possibilities:

Maintain an exhaustive list of logs to capture in the release

This seems to be what we are doing today. We'd just need to list everything else here and commit to keeping it up-to-date.

Modify our other releases to declare which logs they generate should be archived

See Compartmentalizing CloudWatch Logs Agent Configuration Files for an idea of how we could do this. We could have each release / job write a configuration file out declaring which of it's logs should be backed up.

Modify the awslog startup scripts to generate a configuration on boot

I'm envisioning we'd have a small script that would iterate over something like /var/vcap/sys/logs/*/*.log and append each file to the awslogs configuration before start.

Thoughts? Better Ideas?

datn commented 7 years ago

Testing more comprehensive log capture on a test host.

datn commented 7 years ago

Committing my initial attempt at the pre-start script: https://github.com/18F/cg-awslogs-boshrelease/pull/5

Needs more work tomorrow, tho --

datn commented 7 years ago

The above PR seems ready to me. If @cnelson approves and merges, I can test and then push this to the next column.

datn commented 7 years ago

Changes merged to master -- now verifying we are seeing new log events in AWS console.

datn commented 7 years ago

Deployed awslogs changes to logsearch --

datn commented 7 years ago

Didn't work because the additional config dir was being improperly specified.

datn commented 7 years ago

Okay, in testing it looks as if {instance_id} is not unique enough and is causing zillions of this error:

cwlogs.push.publisher - WARNING - 12345 - Thread-67 - Multiple agents might be sending log events to log stream (nats_to_syslog i-a89101112) with sequence token (LONG_STRING_OF_DIGITS|none). This could cause duplicates and is not recommended.

I might need to amend the log stream identifier, or there might be another problem.

datn commented 7 years ago

Okay, after testing different log_stream_name configs (didn't have much hope for that), I conclude that the aggressive harvesting of all files under /var/vcap/sys/log is causing sequence token errors. I'm going to be smarter about which files to ingest in the pre-start script, because gathering all valid logs for compliance is the most important goal here.

datn commented 7 years ago

Okay, I have a unique log stream name & a sane group name and am excluding very obviously non-ingestable logs from the config (files that are not text or empty, and files that have a rotation timestamp) to avoid uselessly spawning log readers. No sequence errors nor multiple-agent errors any more.

datn commented 7 years ago

Just waiting on a thumbs-up and merge so we can redeploy and test.

datn commented 7 years ago

Waiting on merge -- https://github.com/18F/cg-awslogs-boshrelease/pull/7

datn commented 7 years ago

I'm seeing logs with the new log stream name pattern everywhere I look in CloudWatch. I'm just not sure if there's any deploy I need to manually kick off to be sure we're collecting everything everywhere, but this looks good to me.

mogul commented 7 years ago

@sharms will accept later today

sharms commented 7 years ago

The current implementation triggers on pre-start which means that it will be executed before any jobs start. The current code will not add any logs during this time, as it searches for log files which have been written to (1), which do not exist on a stemcell upgrade.

This worked on staging as these systems are frequently rebooted without cause, however it is rare we reboot production.

Recommendations:

  1. Move log detection into startup script

References:

  1. https://github.com/18F/cg-awslogs-boshrelease/blob/master/jobs/awslogs/templates/bin/pre-start#L9
cnelson commented 7 years ago

Good catch! My bad for suggesting pre-start.

cnelson commented 7 years ago

@datn Did you address Steve's feedback about this running in pre-start?

I'm not sure what the best approach is here, but I can see a few options:

jmcarp commented 7 years ago

Post-deploy might be a better place to move this, since we'd want all jobs to have started before we set up awslogs: https://bosh.io/docs/post-deploy.html. Or a cron job.

datn commented 7 years ago

I missed the notification that this went back into ready, will look at it and everyone's helpful suggestions today.

datn commented 7 years ago

I'm not understanding the problem @sharms points out, so I'll ask for clarification in channel.

datn commented 7 years ago

https://github.com/18F/cg-awslogs-boshrelease/pull/11

cnelson commented 7 years ago

Tested on admin-ui in staging, and while new logs are being added to the awslogs config file, they do not appear to be making it to cloudwatch.

I think the problem is that monit is not in the root user's path by default, so this call to restart isn't executing.

cnelson commented 7 years ago

https://github.com/18F/cg-awslogs-boshrelease/pull/12 uses a new approach to ensure logs are captured even on short-lived / ephemeral VMs

mogul commented 7 years ago

@datn can you accept this one?

rogeruiz commented 7 years ago

Accepting this one.