cirocosta / slirunner

Concourse SLI probes runner
6 stars 4 forks source link

slirunner metrics are not available in prometheus #3

Open gowrisankar22 opened 4 years ago

gowrisankar22 commented 4 years ago

Hello @cirocosta

I have tried running sli runner in my concourse setup but I am not able to scrape any metrics.

Steps:

  1. changed credentials https://github.com/cirocosta/slirunner/blob/master/examples/kubernetes.yaml
  2. created a service monitor for Prometheus to scrape the metrics but no luck
    ---
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    annotations:
    labels:
    app.kubernetes.io/name: slirunner
    helm.sh/chart: slirunner-0.1.0
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "0.0.1"
    app.kubernetes.io/managed-by: Tiller
    name: slirunner
    namespace: default
    spec:
    endpoints:
    - interval: 30s
    port: prometheus
    namespaceSelector:
    matchNames:
    - default
    selector:
    matchLabels:
      app.kubernetes.io/name: slirunner
      app.kubernetes.io/instance: release-name

logs:

COMMAND FAILURE---
+ fly -t concourse-ci set-pipeline -n -p slirunner-hijack-failing-build -c /dev/fd/63
++ echo '
resources:
- name: time-trigger
  type: time
  source: {interval: 24h}

jobs:
- name: simple-job
  build_logs_to_retain: 20
  public: true
  plan:
  - &say-hello
    task: say-hello
    config:
      platform: linux
      image_resource:
        type: registry-image
        source: {repository: busybox}
      run:
        path: echo
        args: ["Hello, world!"]

- name: failing
  build_logs_to_retain: 20
  public: true
{"timestamp":"1585470723.331840277","source":"slirunner","message":"slirunner.run.hijack-failing-build.finish","log_level":2,"data":{"error":"command execution failed: command didn't finish on time: context deadline exceeded","session":"1.275"}}
  plan:
  - task: fail
    config:
      platform: linux
      image_resource:
        type: registry-image
        source: {repository: busybox}
      run:
        path: false

- name: auto-triggering
  build_logs_to_retain: 20
  public: true
  plan:
  - get: time-trigger
    trigger: true
  - *say-hello
'
no changes to apply
+ fly -t concourse-ci unpause-pipeline -p slirunner-hijack-failing-build
unpaused 'slirunner-hijack-failing-build'
+ job_name=slirunner-hijack-failing-build/failing
+ fly -t concourse-ci trigger-job -j slirunner-hijack-failing-build/failing -w
started slirunner-hijack-failing-build/failing #55

initializing
running false
failed
+ true
++ fly -t concourse-ci builds -j slirunner-hijack-failing-build/failing
++ head -1
++ awk '{print $3}'
+ build=55
+ fly -t concourse-ci hijack -j slirunner-hijack-failing-build/failing -b 55 echo Hello World
Hello World

COMMAND FAILURE---
+ fly -t concourse-ci destroy-pipeline -n -p slirunner-create-and-run-new-pipeline
!!! this will remove all data for pipeline `slirunner-create-and-run-new-pipeline`

`slirunner-create-and-run-new-pipeline` does not exist
+ fly -t concourse-ci set-pipeline -n -p slirunner-create-and-run-new-pipeline -c /dev/fd/63
++ echo '
resources:
- name: time-trigger
  type: time
  source: {interval: 24h}

jobs:
- name: simple-job
  build_logs_to_retain: 20
  public: true
  plan:
  - &say-hello
    task: say-hello
    config:
      platform: linux
      image_resource:
        type: registry-image
        source: {repository: busybox}
      run:
        path: echo
        args: ["Hello, world!"]

- name: failing
  build_logs_to_retain: 20
  public: true
  plan:
  - task: fail
    config:
      platform: linux
      image_resource:
        type: registry-image
        source: {repository: busybox}
      run:
        path: false

- name: auto-triggering
{"timestamp":"1585470738.196414471","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.finish","log_level":2,"data":{"error":"command execution failed: command didn't finish on time: context deadline exceeded","session":"1.274"}}
{"timestamp":"1585470738.196796417","source":"slirunner","message":"slirunner.run.login.start","log_level":1,"data":{"session":"1.276"}}
  build_logs_to_retain: 20
  public: true
  plan:
  - get: time-trigger
    trigger: true
  - *say-hello
'
resources:
  resource time-trigger has been added:
+ name: time-trigger
+ source:
+   interval: 24h
+ type: time

jobs:
  job simple-job has been added:
+ build_logs_to_retain: 20
+ name: simple-job
+ plan:
+ - config:
+     container_limits: {}
+     image_resource:
+       source:
+         repository: busybox
+       type: registry-image
+     platform: linux
+     run:
+       args:
+       - Hello, world!
+       path: echo
+   task: say-hello
+ public: true

  job failing has been added:
+ build_logs_to_retain: 20
+ name: failing
+ plan:
+ - config:
+     container_limits: {}
+     image_resource:
+       source:
+         repository: busybox
+       type: registry-image
+     platform: linux
+     run:
+       path: "false"
+   task: fail
+ public: true

  job auto-triggering has been added:
+ build_logs_to_retain: 20
+ name: auto-triggering
+ plan:
+ - get: time-trigger
+   trigger: true
+ - config:
+     container_limits: {}
+     image_resource:
+       source:
+         repository: busybox
+       type: registry-image
+     platform: linux
+     run:
+       args:
+       - Hello, world!
+       path: echo
+   task: say-hello
+ public: true

pipeline created!
you can view your pipeline here: https://central-concourse-test.cfopstest.eu1.sapxdc.io/teams/main/pipelines/slirunner-create-and-run-new-pipeline

the pipeline is currently paused. to unpause, either:
  - run the unpause-pipeline command:
    fly -t concourse-ci unpause-pipeline -p slirunner-create-and-run-new-pipeline
  - click play next to the pipeline in the web ui
+ fly -t concourse-ci unpause-pipeline -p slirunner-create-and-run-new-pipeline
unpaused 'slirunner-create-and-run-new-pipeline'
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 0 -gt 0 ']'
+ echo 'waiting for job to automatically trigger...'
waiting for job to automatically trigger...
+ sleep 1
++ wait_for_build
++ fly -t concourse-ci builds -j slirunner-create-and-run-new-pipeline/auto-triggering
++ grep -v pending
++ wc -l
+ '[' 1 -gt 0 ']'
+ fly -t concourse-ci watch -j slirunner-create-and-run-new-pipeline/auto-triggering
initializing
running echo Hello, world!
Hello, world!
succeeded

{"timestamp":"1585470739.497804403","source":"slirunner","message":"slirunner.run.login.finish","log_level":1,"data":{"session":"1.276"}}
{"timestamp":"1585470739.497954607","source":"slirunner","message":"slirunner.run.sync.start","log_level":1,"data":{"session":"1.277"}}
{"timestamp":"1585470740.499159813","source":"slirunner","message":"slirunner.run.sync.finish","log_level":1,"data":{"session":"1.277"}}
{"timestamp":"1585470740.499376059","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.start","log_level":1,"data":{"session":"1.279"}}
{"timestamp":"1585470740.499399185","source":"slirunner","message":"slirunner.run.run-existing-pipeline.start","log_level":1,"data":{"session":"1.278"}}
{"timestamp":"1585470740.499406576","source":"slirunner","message":"slirunner.run.hijack-failing-build.start","log_level":1,"data":{"session":"1.280"}}
{"timestamp":"1585470777.197618961","source":"slirunner","message":"slirunner.run.run-existing-pipeline.finish","log_level":1,"data":{"session":"1.278"}}
{"timestamp":"1585470779.298599243","source":"slirunner","message":"slirunner.run.hijack-failing-build.finish","log_level":1,"data":{"session":"1.280"}}
{"timestamp":"1585470779.597861052","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.finish","log_level":1,"data":{"session":"1.279"}}
{"timestamp":"1585470781.057423592","source":"slirunner","message":"slirunner.run.login.start","log_level":1,"data":{"session":"1.281"}}
{"timestamp":"1585470782.365731478","source":"slirunner","message":"slirunner.run.login.finish","log_level":1,"data":{"session":"1.281"}}
{"timestamp":"1585470782.365795612","source":"slirunner","message":"slirunner.run.sync.start","log_level":1,"data":{"session":"1.282"}}
{"timestamp":"1585470783.397989273","source":"slirunner","message":"slirunner.run.sync.finish","log_level":1,"data":{"session":"1.282"}}
{"timestamp":"1585470783.398109436","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.start","log_level":1,"data":{"session":"1.284"}}
{"timestamp":"1585470783.398141146","source":"slirunner","message":"slirunner.run.run-existing-pipeline.start","log_level":1,"data":{"session":"1.283"}}
{"timestamp":"1585470783.398136616","source":"slirunner","message":"slirunner.run.hijack-failing-build.start","log_level":1,"data":{"session":"1.285"}}
{"timestamp":"1585470817.097322941","source":"slirunner","message":"slirunner.run.run-existing-pipeline.finish","log_level":1,"data":{"session":"1.283"}}
{"timestamp":"1585470818.894902945","source":"slirunner","message":"slirunner.run.hijack-failing-build.finish","log_level":1,"data":{"session":"1.285"}}
{"timestamp":"1585470838.499372721","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.finish","log_level":1,"data":{"session":"1.284"}}
{"timestamp":"1585470841.057343245","source":"slirunner","message":"slirunner.run.login.start","log_level":1,"data":{"session":"1.286"}}
{"timestamp":"1585470842.584072590","source":"slirunner","message":"slirunner.run.login.finish","log_level":1,"data":{"session":"1.286"}}
{"timestamp":"1585470842.584154606","source":"slirunner","message":"slirunner.run.sync.start","log_level":1,"data":{"session":"1.287"}}
{"timestamp":"1585470843.503933668","source":"slirunner","message":"slirunner.run.sync.finish","log_level":1,"data":{"session":"1.287"}}
{"timestamp":"1585470843.504080057","source":"slirunner","message":"slirunner.run.run-existing-pipeline.start","log_level":1,"data":{"session":"1.288"}}
{"timestamp":"1585470843.504645824","source":"slirunner","message":"slirunner.run.hijack-failing-build.start","log_level":1,"data":{"session":"1.290"}}
{"timestamp":"1585470843.504053593","source":"slirunner","message":"slirunner.run.create-and-run-new-pipeline.start","log_level":1,"data":{"session":"1.289"}}
cirocosta commented 4 years ago

aaah that's weird - did prometheus find it as a target? did you get any response from the metrics endpoint? you can try kubectl exec'in into the container and then hitting the prometheus port - if that gives you something, it's probably something to do with how prometheus is scraping it

gowrisankar22 commented 4 years ago

@cirocosta I can find the target but it is empty there is no status in it (up or down). Also container port also seems to be fine. seems like Prometheus scrapping issue. How exactly your setup is working?

Can you share the service monitor spec and verify the example/kubernetes.yaml is upto date ?

cirocosta commented 4 years ago

hmmmm interesting

we actually don't use the Prometheus operator - we add a label to every service that we want their endpoint to be discovered, e.g.:

service:
  type: ClusterIP
  port: 9001
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9001"

https://github.com/concourse/hush-house/blob/72dc4f62df12fd0240b1af3c39b70f63d2c89527/deployments/with-creds/slirunner/values.yaml#L10-L15

which then gets discovered by this service discovery config:

https://github.com/concourse/hush-house/blob/72dc4f62df12fd0240b1af3c39b70f63d2c89527/deployments/with-creds/metrics/values.yaml#L47-L52

https://github.com/concourse/hush-house/blob/72dc4f62df12fd0240b1af3c39b70f63d2c89527/deployments/with-creds/metrics/values.yaml#L47-L74

gowrisankar22 commented 4 years ago

@cirocosta Thanks a lot. scraper configuration that you gave did the trick. Now everything works like a charm

Few more questions:

kubernetes.yml file has a secret for concourse username and password and you did the secrets can be consumable via env but it is not working.

https://github.com/cirocosta/slirunner/blob/master/examples/kubernetes.yaml#L3-L16

COMMAND FAILURE---
error: expected argument for flag `-u, --username’, but got option `-p’

{“timestamp”:“1585632833.303583860",“source”:“slirunner”,“message”:“slirunner.run.login.finish”,“log_level”:2,“data”:{“error”:“command execution failed: exit status 1",“session”:“1.1"}}

https://github.com/cirocosta/slirunner/blob/9441b6efc153596ccd7eaa361488e1a0f3f7cfd3/probes/all.go#L10

it works only when if you pass the user credentials like below.

    command:
      - start
      - --target=test
      - --concourse-url=http://web:8080
      - --password=test
      - --username=test

I have created a PR to read from env: #4

Also, can you let me know how often these tests run and some more details? because it is really a cool thing for monitoring concourse.