Expire messages based on arrival time instead of send time

sergeybykov commented 8 years ago

To avoid wasteful execution of requests that already timed out on the caller side, we expire and drop messages that have their send timestamp set more than the call timeout (by default 30 seconds) from now. This logic assumes that all silos and clients have clocks synchronized within a reasonable threshold of a few seconds.

We saw an interesting outage recently when clocks of a number of machines for whatever reason got set ~50 seconds forward, and hours later reset back. This caused lots of messages to expire upon arrival and their calls to time out.

The proposal is to

Timestamp messages upon arrival to silo.
Expire messages based on their arrival timestamp instead of their send timestamp.
Log a warning every time a message would have expire based on the send timestamp to help detect significant clock drifts and to ease troubleshooting of such incidents.

gabikliot commented 8 years ago

Ohh, really? You run into that? Eventually? Again? :-)

Expiring only on arrival time will be safe, but you will expire less msgs, and in the stress case with multi hop grain calls you might be really expiring much less that what you could have otherwise, running a higher risk of overload.

I remember I had a full design fleshed out how to detect not-synced clocks. It was basically doing a handshake between any pair of silos and client to gateway upon first connection and validating that clocks are synced. If not synced, it could alert, or fail to start one of them, or not use expiration. This handshake could later be extended to do more validation, for example to validate that all silos and clients are configured in the same way (for setting that must match). We already have a place for such handshake in the code. UPDATE: Look for this design in the old one note doc.

Or alternatively, you could move to use a cloud provider that really knows how to sync its clocks. I can recommend one :-)

sergeybykov commented 8 years ago

Yep, a bunch of timeouts out of the blue at 4:30 in the morning. :-)

What's unclear to me is how likely it is for a message to be delayed before it is sent vs. in the queues on the destination silo. Sending queues in theory are independent from grain performance (other than sharing the CPUs with them). The receiving end seems much more sensitive to grain, especially non-reentrant, being slow for external reasons, e.g. their IO dependencies. Without any data to back it up I suspect that the risk of performing wasteful work because of expiring base on arrival time is relatively small.

I have no idea what cloud provider you are talking about. I only know two of them. :-)

gabikliot commented 8 years ago

The real problem is multi hop calls. For that you need the original first send time.

Yes, only 2 providers: GCP and AWS.

sergeybykov commented 8 years ago

By multi-hop you mean message forwarding, right? Which is not subject to activation queue. So messages should be forwarded relatively fast compared to the case of a message getting into a queue of a slow activation with thousands of messages already in the queue.

What's GCP? :-)

gabikliot commented 8 years ago

https://cloud.google.com/

sergeybykov commented 8 years ago

Oh, I forgot about the advertising company. :-)

rrector commented 8 years ago

re: design to detect not-synced clocks. Note that in the recent incident where we encountered this problem, the clock was changed while the system was already up and running, so checking only on first connect would probably not handle this scenario.

gabikliot commented 8 years ago

Wow, that is a very different case! Can you please provide more details? Did the clock just jump forward? Backward? By how much?

This goes way beyond synchronizing clocks between different machines. This is about a clock on a single machine changing out of the blue. This would usually be considered a faulty hardware, at least in my mind. Of course, we can deal with that as well, but the how far should we go? We already deal in Orleans with misconfigured firewalls. And what if tomorrow we find out the infrastructure changes the IP address under us out of the blue? Or maybe just changes memory values? Would we start cross validating all our memory? Do we really think it is application framework responsibility to work around hardware/infrastructure failures? More broadly, an arbitrary application simply cannot work around all cases of jumping clocks. Clock is a basic API.

I think demanding that clock does not jump on a server is a very fundamental and reasonable requirement from the Cloud Provider and application framework should safely rely on that.

jesse-trana commented 8 years ago

Just out of curiosity, what does Orleans currently use as the timesource? If it's using "wall clock" time... well, that's always bound to have problems.

From the embedded Linux world, I'm (unfortunately) used to clocks changing/being inaccurate and the go to there is CLOCK_MONOTONIC (as opposed to CLOCK_REALTIME) to handle this type of problem and use a theoretical timebase rather than a real wall-clock time - it also helps avoid daylight savings time and leap-second adjustments etc. Now that I'm back in the wonderful world of .NET, I haven't really looked into what the current standard for this sort of thing is, maybe there's something nice there.

All that to say is that sadly I'm used to this being squarely an application framework responsibility, not some detail shuffled off to the OS. Just another perspective, not sure the best approach in .NET land.

gabikliot commented 8 years ago

We use DateTime.UtcNow. https://github.com/dotnet/orleans/blob/master/src/Orleans/Messaging/Message.cs#L290

Maybe we should use GetTickCount()?

jesse-trana commented 8 years ago

Huh, that's cool - I had not used that before. I think that's probably good, but using the 64 bit version so we don't hit the ~25 day wraparound. It's too bad Environment.Ticks isn't 64 bit or we'd have a nice cross-platform way to do this. It'd be interesting to see how this fit into your clock-validation scheme - you'd need something to establish the initial tick timebase so that a "Orleans time" could be established.

gabikliot commented 8 years ago

I don't know how to use GetTickCount in our setting, as GetTickCount is relative and not absolute.

My suggestion was to keep using the absolute DateTime.UtcNow, validate it once upon connection, and hope/demand/rely on the server not to change the clock abruptly (without restarting the server).

jesse-trana commented 8 years ago

Right - you would have to establish a notion of shared time in order to make this work. Ultimately, each server has to be able to know something like "Well, X ticks on machine Y corresponds to Z ticks on my machine". To do that I think are multiple options: 1) Establish a better shared notion of time such that after some initialization, silos can convert ticks to "Orleans uptime" or something like that. If an absolute clock is not already built-in, this option is way too much for the specific problem. 2) Establish local time conversions per silo in a more local area of the code. In this particular case, as long as the source of the message is known, I think an easy way to get this would be to add the silo's tick count in addition to the utc time to the message. That way when another server sees it, it can establish an initial frame of reference relative to the UTC time the first message it sees and then just rely on ticks the rest of the time. You still depend on dates the first go around, but then are tolerant to future time changes. System reboots would somehow need to be handled though too...

...and maybe the hope/demand/rely option is a reasonable one. I've just written more than one "time-travelling" bug report before and its nice when they can be avoided.

gabikliot commented 8 years ago

Yep, that's a good improvement idea: start with syncing/validating global time and then use monotonic delta time. That would handle both cases. Are you aware of any distributed application framework (as opposite to general purpose database or lock/lease server) that does that?

jesse-trana commented 8 years ago

Good point: if I understand it, Orleans does its best to leverage underlying data stores for transactions, etc. where synchronized time (or some other notion of synchronization altogether e.g. vector clocks) is critical. Perhaps many distributed application frameworks don't tackle this. If we were down at that level, the time synchronization methods I suggested above are still quite naive. I was just a little surprised to hear that there isn't something like Silo.GlobalTicks. So to put it a different way: is a synchronized global timebase just outside the scope of Orleans?

gabikliot commented 8 years ago

is a synchronized global timebase just outside the scope of Orleans?

I would not put it so categorically. No. Everything is open in theory. We did not need a notion of Silo.GlobalTicks anywhere until now. None of our protocols relies on global time, except for this "early-expiration optimization" that @sergeybykov described above.

My personal opinion (which might be different that others) is that we should not put an effort in the core Orleans to actively synchronize clocks. We can rely on the global time being roughly synchronized, for liveness properties (and not for safety/correctness) BUT we should validate if indeed this assumption holds. The question now is how to validate: upon first connect only or all the time periodically.

rrector commented 8 years ago

Wow, that is a very different case! Can you please provide more details? Did the clock just jump forward? Backward? By how much?

At 4:30 am, some percentage of all our instances were told by the central NTP server to adjust their clocks forward by 50 seconds. About 5 hours later, all of those same instance were instructed to set their clocks back by 50 seconds. So, this was driven via some unknown, external issue with the NTP server.

This goes way beyond synchronizing clocks between different machines. This is about a clock on a single machine changing out of the blue. This would usually be considered a faulty hardware, at least in my mind.

The real time clocks in PCs are not designed to be super accurate, and they do drift. We have our instances set to re-synchronize with NTP every 15 hours, and we frequently see adjustments happening in the 50-300ms range. NTP is generally turned on on all Windows servers by default, so there should be an expectation that time can shift unexpectedly under normal operating conditions - this shift was just a big one.

Of course, we can deal with that as well, but the how far should we go? We already deal in Orleans with misconfigured firewalls. And what if tomorrow we find out the infrastructure changes the IP address under us out of the blue? Or maybe just changes memory values? Would we start cross validating all our memory? Do we really think it is application framework responsibility to work around hardware/infrastructure failures? More broadly, an arbitrary application simply cannot work around all cases of jumping clocks. Clock is a basic API.

Hyperbole much? Clock is a basic API, but it can jump around as indicated above and should be to some degree expected.

And I can concede that for operations that are ongoing when the adjustment happens, that it can be very difficult to ensure that everything works as expected. However, since Orleans requires cross-machine times to be synchronized, you quickly run into a situation where no recovery is possible once times deviate beyond a certain point.

I think demanding that clock does not jump on a server is a very fundamental and reasonable requirement from the Cloud Provider and application framework should safely rely on that.

You're now putting forth contradictory requirements. Orleans requires that all silos and clients have "synchronized" time. Orleans never defines what the allowed deviance is for optimal operation, so I'll just come up with a straw man value of there should be no more than a 5 second difference between the slowest and fastest clock involved.

With your "fundamental and reasonable requirement" that the clock never be adjusted on a running system, and the actual limits on clock quality, assuming even a very high quality crystal (which the Azure hardware likely doesn't have) with a frequency error of 4PPM, we could expect to see individual clocks drift by aprox 1 second every three days. The delta between the fastest and slowest clocks can therefore be expected to increase by up to two seconds every three days. In less than 10 days without adjusting (or "jumpping" as you put it) the clocks on the individual machines using something like NTP, you would be outside my straw man time. Using the standard 30 second timeout value, after 45 days you would be in a situation where all Orleans messages from the slowest clock computer to the fastest clock computer would be immediately dropped when received because it would appear to already have existed for 30 seconds.

I really don't think that you want "a clock that doesn't jump" at all. We all want something like NTP that keeps clocks synchronized to some level of precision.

I would also agree that Orleans should not be in the business of trying to either directly synchronize computer clocks, nor to provide some API that gives some synchronized "silo time".

However, I do think that Orleans has "solved" a relatively simple problem of how long has a message been active by passing an actually fairly difficult requirement of all clocks must be synchronized across all machines along to the users of Orleans.

If each message had a msAliveBeforeCurrentMachine field, along with a tickArrivedOnCurrentMachine field, then the active time of the message is msAliveBeforeCurrentMachine + (CurrentTick - tickArrivedOnCurrentMachine)/ticksPerMs. You just need to be able to update the msAliveBeforeCurrentMachine right before the message is forwarded to a different computer, and the tickArrivedOnCurrentMachine when a message is received. Granted you loose any time spent on the wire - I don't know how critical that is to the calculation.

Again, the biggest issue with the current mechanism of message expiration is that all messages between certain computers will be dropped when you hit the point where clock skew between them reaches a certain level. There is no mechanism for them to recover from this situation, and the only indication you have that you are in this state is the generic "message timed out" exceptions that can occur for a wide number of reasons. Even though this may happen infrequently, it is debilitating to applications running on top of Orleans and it can take significant periods of time to figure out what is really going on.

gabikliot commented 8 years ago

Thank you for detailed response @rrector!

Of course I did not mean that the clock can not ever be allowed to change at all. Of course, the server needs to re-adjust its clock periodically with NTP servers, smoothly. What I meant is that those adjustments should not be big. As you wrote, 10s to 100s of millis periodically is OK. 50 seconds is definitely too much and is clearly abnormal (in my view). That is "a jump".

In case it was not clear: I was NOT suggesting to rely on clocks between machines to be synced. I suggested to validate upon first connection between each pair of machines that the skew is within the limit that we can deal with, and only if that holds work with that. The question now is what do we do with big jumps while the server is running. The answer depends on your receptiveness to this abnormal 50 sec jump. You either accept it as a valid infrastructure behavior and deal with it in your system, or you think about it as an abnormal behavior and in such a case your system is allowed to stall/stop responding/violate liveness (it should never violate safety, which holds in our case). So the answer clearly depends on how much you trust your infrastructure. I understand that in your case you don't trust it much not to "jump" the clock 50 sec up and down at will. Therefore, we need to work around it. Just like we worked around faulty firewalls.

I did not understand your proposed algorithm with msAliveBeforeCurrentMachine and tickArrivedOnCurrentMachine. Could you please explain with more details?

veikkoeeva commented 8 years ago

@gabikliot, @jesse-trana, @rrector Now that I remember this again as Yevhen tweeted a recent video, this might give nice ideas to the discussion: How to Have Your Causality and Wall Clocks Too. Maybe a bit unrelated, but idea about taking a median between servers using the population protocols look like a smart one to me.

GitHub repository: https://github.com/Comcast/dmc-sim.

veikkoeeva commented 8 years ago

Some more thoughts for casual readers, robustness with regard this feature might be surprisingly important for a large, long-runnin deployment. I tried to find out how are the clocks synchronized and I found rather anedoctal blog posts only.

This goes slightly off-topic, but as security is a topic ever present these days, there is more context on time drift caused deliberately. Maybe it might be possible to make a Orleans deployment malfunction badly enough to it become virtually unusable by specifically crafted time synchronization messages. Now that Orleans is heading towards cross-platform, there was a recent NTPD vulnerability which has some very dire consequences for what I understand:

mjyeaney commented 8 years ago

Been thinking about this for a bit, and it is indeed a troublesome issue on many fronts. Clocks drift as a physical fact, and NTP implementations can lie (faulty upstream servers, correction jumps, etc.). Azure time synchronization is (to date) prone to curious drifts/jumps, and as @veikkoeeva has mentioned, there are significant security implications of clock synchronization failures (either accidental or malicious).

Offset detection methods would catch some of these issues, but unless you completely decouple from physical time and adopt logical time (as mentioned above), various forms of this issue will persist. However, this would be difficult from a usability perspective on the API - specifying timeouts in terms of minutes/seconds is very natural, and changing that could be disruptive.

Since callers/clients are part of any distributed system, an alternative idea/discussion point is to make them active participants in determining timeouts. While this has obvious shortcomings if a client vanishes/crashes during this time, it may solve some more common cases.

NOTE: I'm inventing terms that likely don't exist, as I'm not super-familiar with Orleans internals. Instead, I'm just thinking about this problem from the "outside looking in", and therefore don't intend this idea to be prescriptive. Rather, just a conversation point.

The basic workflow may look like this (note clients/callers may themselves be Grains):

Client/caller sends message to a target Grain (as normal).
If client (or configuration) has enabled message Timeouts, client/caller schedules background Task to send another message after a timeout interval passes. This type of message should use a dedicated communication channel/inbox to the target Grain (essentially a priority queue/channel).
Grains should always check for these high-priority Timeout messages first before reading the next message from the inbound channel, and of course track what messages are to be dropped, etc. (This assumes the above mentioned dedicated channel for these notifications...may be useful for other system events).

As mentioned above, this doesn't work for all cases, and has inherent complexity in managing dual channel per actor (although still single-threaded). However, it does decouple Grains from each others clocks, while still enabling old work to be dropped if the client has waited too long (so long as a client was able to send a 'Cancel' message), preserves client retry operations, etc.

There's also complications with removing the scheduled background task (#2 above) upon completion, handling race scenarios, etc. - no small task.

Finally, this potentially has increased server-to-server communication overhead; it's not clear to me if this would be better, worse, or equivalent to messages that would be involved compared to a polling type solution between hosts.

Feel free to discuss / dismiss / modify - again, I'm viewing this as a distributed system of actors that needs to enable message timeout/expiration cases, so this may be impossible within the Orleans internals.

sergeybykov commented 7 years ago

Resolved via #2922.

dotnet / orleans

Expire messages based on arrival time instead of send time #1524