bird-house / birdhouse-deploy

Scripts and configurations to deploy the various birds and servers required for a full-fledged production platform
https://birdhouse-deploy.readthedocs.io/en/latest/
Apache License 2.0
4 stars 6 forks source link

Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

Open tlvu opened 4 years ago

tlvu commented 4 years ago

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/140

Automated deployment was triggered but not performed on boreas because of

++ git status -u --porcelain
+ '[' '!' -z '?? birdhouse/old_docker-compose.override.yml_18062019' ']'
+ echo 'ERROR: unclean repo'
ERROR: unclean repo
+ exit 1

We need some system to monitor the logs and send notification if there are any errors. This log file error monitoring and notification can be generalized to watch any systems later so each system is not forced to reimplement monitoring and notification.

This problem has triggered this issue https://github.com/Ouranosinc/PAVICS/issues/176

There are basically 4 types of monitoring that I think we need:

huard commented 1 year ago

Solutions to convert logs into Prometheus metrics:

huard commented 1 year ago

Managed to get Fluentd to parse NGinx logs.

  1. Build docker image with the prometheus plugin
    
    FROM fluent/fluentd:edge

Use root account to use apk

USER root

below RUN includes plugin as examples elasticsearch is not required

you may customize including plugins as you wish

RUN apk add --no-cache --update --virtual .build-deps \ sudo build-base ruby-dev \ && sudo gem install fluent-plugin-prometheus \ && sudo gem sources --clear-all \ && apk del .build-deps \ && rm -rf /tmp/ /var/tmp/ /usr/lib/ruby/gems//cache/.gem

COPY fluent.conf /fluentd/etc/

COPY entrypoint.sh /bin/

USER fluent


2. Configure `fluent.conf`:
@type tail
path /var/log/nginx/access_file.log.*
# pos_file /var/log/td-agent/nginx-access-file.pos  # Store tail position across restarts
# follow_inodes true  # Without this parameter, file rotation causes log duplication.
refresh_interval 2
<parse>
  @type regexp
  expression /^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] \"(?<method>\w+)(?:\s+(?<path>[^\"]*?)(?:\s+\S*)?)?\" (?<status_code>[^ ]*) (?<size>[^ ]*)(?:\s"(?<referer>[^\"]*)") "(?<agent>[^\"]*)" (?<urt>[^ ]*)$/
    time_format %Y-%m-%dT%H:%M:%S%:z
    keep_time_key true
    types size:integer,urt:float
</parse>
tag nginx

@type grep key path pattern /\/twitcher\/ows\/proxy\/thredds\/(dodsC|fileserver)\// @type parser key_name path reserve_data true @type regexp expression /.*?\/thredds\/(?[^\/]+)(?:\/(?[^\?]*))(?:\?(?[^\=]+))?(?:=(?.*))?/ @type prometheus name nginx_size_bytes_total type counter desc nginx bytes sent key size name nginx_thredds_transfer_size_kb type counter desc THREDDS data transferred [kb] key size remote ${remote} tds_service ${tds_service} dataset ${dataset}

expose metrics in prometheus format

@type prometheus bind 0.0.0.0 port 24231 metrics_path /metrics

@type prometheus_output_monitor interval 2

hostname ${hostname}


and launch the container. It will monitor the logs, and parse any new log added. The `pos_file` allows to store the position of the last read. 
huard commented 1 year ago

For the record, the installation of the service was straightforward, but its configuration was painful. The documentation is confusing (the same names mean different things in different contexts) and the regexps use the Ruby syntax, which is slightly different from the Python syntax. The error messages are clear enough however to fix issues as they arise.

tlvu commented 1 year ago

@fmigneault I have a vague memory that canarie-api parse the nginx/access_file.log? Any problem if the format of that file change to json? It's easier to parse json format when we need to ingest the logs to extract metrics.

fmigneault commented 1 year ago

@tlvu Yes. It parses the logs using this regex: https://github.com/Ouranosinc/CanarieAPI/blob/master/canarieapi/logparser.py#L53 We would need some option and a small code edit to handle the JSON format instead.

tlvu commented 1 year ago

@fmigneault finally I think Nginx allows to write the same log to several files so I'll keep the existing log file intact and write the json format to a different log file and then we can parse that other file. This is a more flexible and backward compatible solution. This will probably result in a new optional-component so the solution can be re-used by other organizations deploying PAVICS.

huard commented 5 months ago

This looks promising: https://github.com/martin-helmich/prometheus-nginxlog-exporter