dsschult commented 7 years ago

We'd like to get the glidein logs:

stdout/stderr of glidien
condor startd, master

and upload them to a central server.

I'd like to use HTTP PUT, maybe with basic authentication (could use the pool password if you wanted). That is simple enough that it should always work, without requiring cvmfs or anything installed on the worker node.

hskarlupka commented 7 years ago

Potential Solution: Run Minio in a Docker container at IceCube. Minio acts as an open source version of S3. A Glidein site would be provided with a key and secret that they could put into their configuration file.

When submit is run on the client a signed POST URL is generated using the minio python bindings. This URL is shipped as input with the job.

The glidein bash script would have a trap here that would tar up the startd logs and upload them to the minio server at IceCube.

At IceCube a second process on the pyglidein server would be started called log_importer. This would use the minio python bindings to watch activity on buckets. When a new file arrives, the service would download the file and use the ElasticSearch python bindings to do a bulk import of the data. A good example of this in action is here.

dsschult commented 7 years ago

Note that you don't technically need the ElasticSearch python bindings if they become hard to work with. As an example, here is the inserter code from iceprod.

hskarlupka commented 7 years ago

Send Pyglidein StartD Logs Back to IceCube

hskarlupka commented 7 years ago

112

hskarlupka commented 7 years ago

The team talked this afternoon after reviewing the the logging code. One idea that came up was to inject the URL of the uploaded log file into a classad that got shipped home.

dsschult commented 7 years ago

Note for running multiple minio instances behind a reverse proxy: https://github.com/krishnasrinivas/cookbook/blob/68b6dab51f557ed437449104970abcf3bacf4b7b/docs/multi-tenancy-in-minio.md

hskarlupka commented 7 years ago

My first attempt at this uses presigned put and get S3 urls generated by the client process at each grid site. Each site would have to add a [StartdLogging] section to their configuration that includes three variables:

send_startd_logs: This can be set to True or False
url: The S3 endpoint URL. This can either be AWS or a Minio instance.
bucket: The name of the bucket that the log files should go to.

I added a new client flag called --secrets to the client command. It defaults to .pyglidein_secrets if not set by the user. The file is configured the same way as the config file, but should only contain secrets. The reason for pulling secrets out of the configs is to ensure users don't push secrets to the pyglidein repo. When StartdLogging is enabled the secrets file should also contain a [StartdLogging] section with these variables:

access_key: S3 Access Key
secret_key: S3 Secret Key

For each job the client submits to a cluster, it generates a presigned put and get url. These are passed as environment variables to the job. A log_shipper script is forked at job start time on the execute node that tars up the log directory and uploads the file to the S3 endpoint every five minutes. The glidein start script now respects SIGTERM and SIGINT. The condor process is killed and one more log shipment is run after receiving a SIGTERM or SIGINT from the scheduler.

A PRESIGNED_GET_URL classad is injected into each glidein startd using the STARTD_ATTRS expression. The classad can be accessed in the condor history file for debugging issues after a crash.

To create the IAM user, S3 Bucket, and Policy in AWS for shipping logs I created a cloudformation template that generates these resources. This template could be invoked for each site that wants to send logs. This ensures each site has its own set of credentials and permissions to write to a single S3 bucket in AWS. https://github.com/WIPACrepo/pyglidein/blob/logging/cloud_formations/logging_bucket.json The bucket life-cycle is set to delete files older than 90 days so the size of the bucket doesn't get out of control.

In the event of the site going away the entire cloudformation could be deleted causing all the resources that were created to be deleted as well.

WIPACrepo / pyglidein

get glidein logs #84

112