Skatteetaten / vagrant-hashistack-template

Starter template for development using the vagrantbox Skatteetaten/hashistack
14 stars 8 forks source link

Healthcheck countdash #63

Closed Neha-Sinha2305 closed 4 years ago

Neha-Sinha2305 commented 4 years ago

Closes issue?

closes #56

What was added?

Added ansible tasks to perform healthcheck for countdash nomad job

Checklist

pdmthorsrud commented 4 years ago
url: http://localhost:8500/v1/health/checks/count-dashboard

Earlier, I proposed not filtering on "passing" and just fetching all healthchecks, then failing if the resulting json contains failing or warning. I cannot quite remember who were in that discussion (I'll go looking for it now), but are you all still for explicitly checking for a number of tasks? I like not checking for a number because it needs less maintenance (changing number of checks won't break any ansible-healthchecks), but I remember at least Nikita wanted to use that method of checking because it is explicit. Any thoughts?

Neha-Sinha2305 commented 4 years ago
url: http://localhost:8500/v1/health/checks/count-dashboard

Earlier, I proposed not filtering on "passing" and just fetching all healthchecks, then failing if the resulting json contains failing or warning. I cannot quite remember who were in that discussion (I'll go looking for it now), but are you all still for explicitly checking for a number of tasks? I like not checking for a number because it needs less maintenance (changing number of checks won't break any ansible-healthchecks), but I remember at least Nikita wanted to use that method of checking because it is explicit. Any thoughts?

Do you mean changing the code to :

- name: countdash healthcheck test without consul token (when consul acl default policy is allow)
  uri:
    url: http://localhost:8500/v1/health/checks/count-dashboard
    method: GET
    return_content: yes
    status_code: 200
    body_format: json
  when: not lookup('env', 'consul_acl') | bool
  register: result_countdashboard
  retries: 15
  delay: 15
  until: '"critical" not in result_countdashboard | string'
  tags: test

If yes, then I would better go with the current implementation where we explicitly filter for "passing" status. (I can explain/prove it with code commits that we can revert later)

pdmthorsrud commented 4 years ago

If yes, then I would better go with the current implementation where we explicitly filter for "passing" status. (I can explain/prove it with code commits that we can revert later)

Yup, that is what I was thinking of, except checking for warning as well. Yes, I would very much like an explanation as to why! 😀

Neha-Sinha2305 commented 4 years ago

If yes, then I would better go with the current implementation where we explicitly filter for "passing" status. (I can explain/prove it with code commits that we can revert later)

Yup, that is what I was thinking of, except checking for warning as well. Yes, I would very much like an explanation as to why! 😀

Here you go. The latest commit has the following two ansible tasks that does the healthcheck when consul_acl_default_policy is set to deny. Inorder to show the difference in behavior i have taken away the consul token and so expect both of these to fail, however, you will notice that the first one does not fail and behaves as a false positive. Reason being that the api returns empty list when called without the token and so the first test passes as the verification point is not so explicit.Note: It would behave the same even if we check for "warning" or any other string. Please have a look at the latest github actions check for more details.

And if it is unclear then surely we should discuss it further(may be over a call). :)

https://github.com/fredrikhgrelland/vagrant-hashistack-template/blob/0fc5a078272df6e115392562396a4006da4c5422/template_example/dev/ansible/03_healthcheck_test.yml#L20-L50

pdmthorsrud commented 4 years ago

Here you go. The latest commit has the following two ansible tasks that does the healthcheck when consul_acl_default_policy is set to deny. Inorder to show the difference in behavior i have taken away the consul token and so expect both of these to fail, however, you will notice that the first one does not fail and behaves as a false positive. Reason being that the api returns empty list when called without the token and so the first test passes as the verification point is not so explicit.Note: It would behave the same even if we check for "warning" or any other string. Please have a look at the latest github actions check for more details.

Thank you so much! :) If I understand correctly it's a false positive because the list is empty, and therefore does not contain either critical or warning, but we could solve that by adding and length != 0? Then we have a type of check that could be universally used by all services in the future (which is valuable IMO). I realise I'm adding more and more complexity to my "simple" solution though. 😄

I'm open to the fact that I might be missing even more problems with my approach, but to me it seems easier, quicker, and more universal. It's not a huge deal, and I am OK with keeping it as is if you (and/or others) think this is a better solution! I realise this is the way we've done it until now as well.

Neha-Sinha2305 commented 4 years ago

Here you go. The latest commit has the following two ansible tasks that does the healthcheck when consul_acl_default_policy is set to deny. Inorder to show the difference in behavior i have taken away the consul token and so expect both of these to fail, however, you will notice that the first one does not fail and behaves as a false positive. Reason being that the api returns empty list when called without the token and so the first test passes as the verification point is not so explicit.Note: It would behave the same even if we check for "warning" or any other string. Please have a look at the latest github actions check for more details.

Thank you so much! :) If I understand correctly it's a false positive because the list is empty, and therefore does not contain either critical or warning, but we could solve that by adding and length != 0? Then we have a type of check that could be universally used by all services in the future (which is valuable IMO). I realise I'm adding more and more complexity to my "simple" solution though. 😄

I'm open to the fact that I might be missing even more problems with my approach, but to me it seems easier, quicker, and more universal. It's not a huge deal, and I am OK with keeping it as is if you (and/or others) think this is a better solution! I realise this is the way we've done it until now as well.

Since this test is added under template_example and is going to be an example code for users to get inspiration/clues from, so I have made the check/test generic as you suggested. The code now looks like this:

- name: countdash healthcheck test with consul token (when consul acl default policy is deny)
  uri:
    url: http://localhost:8500/v1/health/checks/count-dashboard
    method: GET
    headers:
      X-Consul-Token: "{{ consul_token }}"
    return_content: yes
    status_code: 200
    body_format: json
  when:
    - lookup('env', 'consul_acl') | bool
    - lookup('env', 'consul_acl_default_policy') == 'deny'
  register: result_countdashboard
  retries: 1
  delay: 1
  until:
    - result_countdashboard.json | length == 1
    - '"warning" not in result_countdashboard | string'
    - '"critical" not in result_countdashboard | string'
  tags: test