ManageIQ / manageiq

ManageIQ Open-Source Management Platform
https://manageiq.org
Apache License 2.0
1.34k stars 898 forks source link

ManageIQ silently throws out Azure events if appliance time isn't synchronized #14838

Open izapolsk opened 7 years ago

izapolsk commented 7 years ago

I noticed that our sprout appliances with default chrony's config aren't syncronized with time servers for some reason. It made manageiq throw out all obtained azure events. appliance logs didn't contain any warnings about that. So, it looked like manageiq received events but event_streams table didn't have any of received events. I believe logs should contain some warnings when events are rejected because of wrong time/date.

djberg96 commented 7 years ago

@carbonin Is this something we can check on the backend and log?

carbonin commented 7 years ago

Check how @djberg96 ?

We can't really know that the time is not synced properly without knowing what the "right" time is, and if we knew that we would just set the time to that 😉

We restart chronyd when MiqServer starts so that should do a quick sync.

@izapolsk are these appliances that hang around for a long time? Maybe we need to force a time sync on a schedule or tune chrony to just work better in this type of environment.

In general I think this behavior of throwing out events should be logged in some way though. I'm not sure what the logic around this is, but if the time is off in the way I would expect of a virtual environment the events would look like they were from the future which for sure requires some special attention 😄

djberg96 commented 7 years ago

@carbonin You mentioned forcing a time sync from time to time or tuning chrony. Both seem reasonable.

izapolsk commented 7 years ago

@carbonin, when I figured out the reason, I made sprout (our tool for preparing appliances) to configure chrony in prepared appliances. So, the issue is already fixed on our side. However, it took much time to find out what caused the issue. I wouldn't spend so much time if logs contained some statements that events were rejected because of time mismatch.

So, it would be great to add some logging for cases when events are dropped for some reason.

djberg96 commented 7 years ago

@izapolsk Since you've fixed this for Sprout, and this isn't an actual production issue, can you please close?

izapolsk commented 7 years ago

@djberg96, actually, the issue was mainly about logging. CFME still silently throws out events even with debug log level. So, I wouldn't want to close it.

djberg96 commented 7 years ago

@izapolsk Then I leave this up to @carbonin to accept/reject. :)

carbonin commented 7 years ago

@djberg96 It seems like this is out of my area now. I'll let @bronaghs take a look at whether we need to log more verbosely when we throw events out due to timing issues.

durandom commented 6 years ago

btw the same problem exists with amazon - but aws doesnt silently fail, but complains about unauthenticated requests IIRC. They use time to sign the requests.

djberg96 commented 6 years ago

Through some trial and error on a Linux box where I disabled ntp and set the date/time manually, it seems that Amazon has a maximum offset of 300 seconds (5 minutes) before it complains. I'm guessing we'll need to be more granular for metrics.

We can handle this programmatically for all providers I think, with this small bit of code:

require 'net/ntp'

if Net::NTP.get.offset >= 60 # or whatever threshold we want to set
   warn "Your clock is off. Fix it and try again"
   # add real handling here
end

Whether this should be in the provider code, or implemented in a more generic manner within the core code somehow, I'm not sure. I feel like this is one of those things that should be checked periodically by the app. Maybe on launch and then once per 24 hours after that or something.

blomquisg commented 6 years ago

Hrm, interesting @djberg96 .... I'd like to see this propagated as a Notification (one of the things that shows up in the UI under the "bell" icon).

I think the only real handling we can do is to shut down the impacted workers and alert the admin (Notification, logs, etc.). I doubt there's much we can do otherwise.

As for whether this belongs in the Provider or not ... I think it should be in two places:

  1. The main MiqServer (or Orchestrator, in an H-release world) should check on whether the chronyd service is running. I'm not actually sure if it can do this (in either the current Virt Appliance, or in the future container world), but it would be fantastic if it could. Then, it could indicate systemic problems instead of letting all the workers figure it out for themselves.

  2. Even if the MiqServer can do this, the individual workers should still include the type of error handling you're talking about. The workers should be able to recognize the fact that they're in a bad situation and be able to abort if necessary, and let MiqServer/Orchestrator know why they aborted.

djberg96 commented 6 years ago

Well, I think if the skew is >= 60 seconds, it's probably a safe assumption that either chronyd isn't running or it's disabled, so I don't know that we need to scan the process table for chronyd, though we could if we have to via sys-proctable.

blomquisg commented 6 years ago

@carbonin / @djberg96, resurrecting this thread

I see two ways of handling this.

  1. Have something more fundamental to the appliance monitor the time displacement. Could be a Rails initializer, or even a Schedule Worker job.

  2. Have all providers be responsible for reacting to whether the are failing API requests because of a time displacement.

I think that having one place to monitor this is probably better. I'm not sure I want to keep the appliance from starting (as I think an initializer would do). So maybe a Schedule Worker job? But, we probably also need to make sure that if/when a Provider fails a Provider API call because of a time displacement problem, MIQ can notify the end user why something they were expecting to work didn't work.

/cc @Fryguy

miq-bot commented 6 years ago

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

JPrause commented 5 years ago

@izapolsk is this still a valid issue. If not can you close. If there's no update by next week, I'll be closing this issue.

JPrause commented 5 years ago

Closing issue. If you feel the issue needs to remain open, please let me know and it will be reopened. @miq-bot close_issue

chessbyte commented 3 years ago

Don't understand why this was ever closed.

bdunne commented 3 years ago

FYI... we're getting the application out of the business of managing NTP on the platform that we run on. It's not possible in a podified environment and some things weren't working on the appliance, so rather than re-writing it, we decided to tell people to use Cockpit for that on the appliance. https://github.com/ManageIQ/manageiq-appliance_console/pull/99 https://github.com/ManageIQ/manageiq/pull/20623

chessbyte commented 3 years ago

@agrare and @Fryguy suggested an approach that, during credential validation, checks the time from the provider vs the current time on ManageIQ. If they differ above some threshold, to minimally warn about it and maximally disable the provider until the time disparity is addressed.