Improved handling of unexpected disconnects

gizmotronic commented 11 years ago

This change builds on @DubbyTT's previous work, adding an event trigger for the case where Turntable is down (responding with an 302 error instead of JSON), for a name lookup or other connection issue in whichServer, and for the case where a firewall/proxy issue causes the WebSocket to remain up, but we are nevertheless not seeing valid server responses.

This change also renames the "wserror" event to "error" for two reasons:

Is it a WebSocket error, or a whichServer error?
It's consistent with the Events class documentation.

(I'm deliberately not including a compiled version in this request. Please either use a CoffeeScript test harness or compile the source using the project Makefile.)

technobly commented 11 years ago

Interestingly enough, a "WebSocket error, and a whichServer error" are both 'wserror'. If they both cause the bot to disconnect from the room, why change it? Event documentation doesn't say anything about how you should name your messages, that's up to you.

However, once you add the proxy thing into the mix, I kind of think 'error' and 'wserror' are not descriptive enough anymore. Now to make it intuitive you should probably emit a 'disconnected' message instead for all three cases. But if you do that, does it make sense to emit that from the .onError() function anymore? I don't think so.

If you know certain errors cause a disconnect, then we should probably just log the error, and call .onDisconnected() which emits a 'disconnected' message.

gizmotronic commented 11 years ago

No, seriously:

When an EventEmitter instance experiences an error, the typical action is to emit an 'error' event. Error events are treated as a special case in node. If there is no listener for it, then the default action is to print a stack trace and exit the program.

I see zero value in propagating the number of events we emit that all mean the same thing: "you're not talking to Turntable, pal." If you truly care about the precise reason for this you simply inspect the Error object and behave appropriately.

Thanks for pulling out the onError() straw man. Why didn't you call it onWsError()? Does it make sense to emit a wserror event as a result of handling an onerror callback, which itself had also seperately emitted a wserror event? (This is in fact all nonsense, and I have no idea why we're using callbacks instead of handling the events that the WebSocket object emits. It's certainly not a performance penalty, no matter what your intuition might be telling you.)

I'd love to hear what Alain thinks about this change, in any event.

technobly commented 11 years ago

"If there is no listener for it, then the default action is to print a stack trace and exit the program."

Bad idea... as soon as you start emitting 'error' you are going to crash everyone's bots that are not "handling" this message, which they won't be because it's not obvious to do. You'd have to update ALL of the examples to add this bot.on('error', function() { } ); handler.

I named it onError() to mimic the way the websocket handlers were done, just like Alain did... but then I felt it was more appropriate to call it exactly what it was at the bot level... a websocket error 'wserror'.

gizmotronic commented 11 years ago

Of course it'll crash. That's the point. Do you somehow not understand that's already the lazy coder's auto-reconnect? I don't know anybody that writes perfect code that never crashes. I know a lot of people using a variety of methods to keep their bots online, regardless. This is hardly as bad as you make it out to be, and in fact is more than likely to be a boon to many.

technobly commented 11 years ago

console.log("At this point you're acting counter-productively, pal...");
process.exit(1); //ragequit

gizmotronic commented 11 years ago

First, I'm sorry. My frustration is no excuse for implying that you're stupid. I was wrong.

I would appreciate it if you would assume that this is a very carefully considered and tested proposal, because that's exactly what it is. I've been working on it on and off since a couple of weeks before you submitted your pull request. I get that you don't like this change - and that's okay, really - but this should be an opportunity for a thoughtful discussion of the merits.

Izzmo commented 11 years ago

So, this brings up an interesting point. @DubbyTT feels like we should be catching errors and then handling them appropriately within the framework, but so does @gizmotronic, in a different sense, from multiple angles.

I don't know if this is really what we should be doing, honestly. If the script disconnects in any way, and throws an error, this is obviously the right thing to do. If the person using the framework is not handling it, let it error out. In most cases, people's scripts are useless when not connected anyways.

Now, if we are arguing that we should auto-reconnect the bot on this error, then that's a discussion that can have many outcomes. Should we really be handling this on our end? Different bot creators may want their bots to do different things when their bot is disconnected. What if the bot is permanently banned from turntable.fm? It will be continuous loop of connecting and disconnecting. We don't want this either, and possibly some logic could be implemented to prevent this.

Either way, the best part about this framework is it's simplicity in just redirecting WebSocket events through an easy-to-follow API so people can do with the calls what they want. The more we try and add to this the less versatile it becomes. The only real thing I would suggest doing is implementing functions that bot creators CAN use if they feel like it, but not make it mandatory.

Discuss.

gizmotronic commented 11 years ago

I think you have it right when you say that the best part is that the framework just redirects (or maybe "reflects") WebSocket events through the API. That's quite explicitly my goal in emitting events to expose the error conditions in whichServer. The fail-safe timeout is likewise designed to detect an unusual, but fatal, condition and emit an event when that happens.

Connection problems are a special case, in general. It's one thing for most of the API to fail and return that result via the callback. It's another thing entirely when the connection drops, since the bot won't do anything meaningful after that. This is why I advocate throwing an 'error' event so that either the bot stops running - it's dead anyway - or the bot creator handles (and preferably responds to) the event.

I actually did spend a bunch of time doing reconnects right in the library. For all of the reasons you mention, as well as the added complexity of a mechanism to ensure that there's only one connection attempt at a time (this turns out to be a bit trickier than it seems), I think you've got it exactly right. The bot creator should be given the control.

Izzmo commented 11 years ago

So I just took a look myself at your changes. Unfortunately, you don't have any backwards compatibility in place. Anyone who might be using the wserror exception will no longer get this and code will have to be updated. When dealing with API's, it's always good to have backward compatibility, at least for a decent amount of time, instead of an abrupt change.

I think your changes would be OK if you included schemes for both wserror and error alike. @DubbyTT said basically the same thing.

gizmotronic commented 11 years ago

I was hoping to avoid this given that it's a very recent change, but point taken. Thanks for looking at it.

Izzmo commented 11 years ago

Understandable, but it is a pretty big deal :smile:

gizmotronic commented 11 years ago

Because this does change the API in a way that's not backwards compatible, and because there is no way to bridge the gap in the interim, this change should not be merged as part of a point release. If you agree that emitting an 'error' event is more correct (which I readily admit is a primary point of contention), then there will be a flag day situation if and when that happens. It would be better to have it sooner than later in that case. This minimizes the number of people who would be affected.

Izzmo commented 11 years ago

Seems fine to me. Unless people pull this copy from here and copy it to there npm directory, it won't get in there until Alain publishes to npm. So, I'm fine with it: the throwing of error instead of wserror, that is.

samuri51 commented 11 years ago

so the only real difference is that error will now be thrown instead of wserror? does this remove the 'alive' event as well? I personally am against the bot trying to do anything other than give the developer events and functions to work with. like in my bot personally i try to pair having the bot detect whether its physically in the correct room or not with the wserror event (if the bot gets kicked to the lobby or something its still technically connected to turntable). i need to know when that event is thrown because without an internet connection my other method of detection does nothing. and i don't wanna have clashes going on between them etc.

samuri51 commented 11 years ago

i'm also not a fan of it stopping trying to reconnect after a while. that should be optional if implemented. what if someone's bot get's kicked from the room twice and is banned for 15 minutes or something but they themselves are not the runner of their bot? if its allowed to try to reconnect the bot will be back in 15 minutes... otherwise you'll have to wait till you can get in contact with the bot runner etc.

gizmotronic commented 11 years ago

The error event would be emitted instead of the wserror event, but for connection errors in addition to just the WebSocket errors. For example:

DNS lookup failures (usually a local network outage)
302 responses from the server (usually Turntable maintenance outages)
A two minute period where we haven't seen a valid response to the presence update pings we send every 10 seconds

I did not remove the alive event. I don't think it's strictly necessary, but there's also nothing wrong with it, and I don't see any reason to remove it.

Reconnect attempts aren't done within the framework. It's up to the bot creator to handle disconnections and take action.

technobly commented 11 years ago

@Izzmo @samuri51 @alaingilbert @gizmotronic I'm going to outline this again so everyone knows what's good and bad about these changes:

changing the 'wserror' emit to 'error' should not be done. 'error' is a special case handled by NodeJS, if the 'error' message is not handled in your bot with something like bot.on('error', function(data) { console.log(data); } ); your bot script will crash when any error message is received. Right now the 'wserror' is specifically used for disconnects. 'error' does not imply it's disconnected, and a generic error handler should not be the reason you try to make your bot reconnect. You should try to reconnect when you get a message such as 'disconnected'. Naming the error message 'disconnected' will not cause your bot to crash if you don't handle the message, it just won't reconnect to turntable.
all of the errors that cause the bot to disconnect should end up emitting a 'disconnected' event. Currently there are three known causes: 1) Websocket error 2) whichServer error 3) Inactivity (transparent proxy issue)
when the 'disconnected' event is emitted, a warning should be logged to the console that states something like "DISCONNECTED FROM TURNTABLE ROOM" regardless of the state of bot.DEBUG. This will help users troubleshoot what is going on if they haven't added any autoreconnect code to their bot script.
the proposed changes just make the TTAPI emit messages ... nothing is done in the TTAPI automatically to reconnect the bot. You still have to write some code to detect the message, and call bot.roomRegister(ROOMID);
the 'alive' message is still in the bot.

gizmotronic commented 11 years ago

So, at its core, this is the disagreement (I think):

@DubbyTT says it's bad for the bot to crash when it's disconnected from Turntable.
I say it's bad for the bot to keep running when it's disconnected from Turntable (unless the bot creator handles the error event).

My reasoning is simple. Many people are already using external mechanisms to restart their bot when it crashes unexpectedly. Throwing an error means that these bots would reconnect without any additional work.

technobly commented 11 years ago

I would argue that most people that run turntable bots do not use process monitors. It is not an intuitive thing to do. When I got one up and running, the process monitor itself would crash.

If we have a mechanism for reconnecting the bot (which we do), let's promote that! Not let people's bots crash and have them wonder what went wrong. Will they even come here to ask how to fix it? Do we want them coming here and asking "why is my bot crashing?"

When you restart the process in that way to reconnect the bot, the state of it's memory is reset. There is no telling what affect that has on the bot. For example, say you have a queue loaded... but no way to reload it on boot. If you just reconnect, everything is a-ok still running.

gizmotronic commented 11 years ago

I wonder how people handle unexpected crashes currently. At least some hosting providers give you a process monitor whether you want it or not. But, even if you run it yourself it's simple. When I ran my bot at home I used a batch file that looped unless it got an exit code that meant "no rly, just shut down". Am I weird for thinking that this is not a hard problem to solve?

Izzmo commented 11 years ago

I will just say this: NPM packages should never crash. (npm says it right on their website when publishing and making packages.) With that in mind, then we should not ever throw an unhandled exception, and therefor should not stray from using the wserror, because it does provide some insight into what is happening: an error with the websocket, most commonly the disconnect.

MikeWills commented 11 years ago

I'll be honest I haven't read the whole issue. When I was an inexperienced bot dev, if my bot crashed because of something in the API, I would have submitted that as a bug. I agree with @Izzmo, NPM packages should not cause a script to crash. They should be sent back to the script for the script to handle if the developer feels it's needed. Otherwise it should be able to be safely ignored and not affect the script as a whole.

gizmotronic commented 11 years ago

I'm not sure where that rule came from. Packages like redis, which is fundamental to a lot of applications that use a key-value store, emit error events that will cause a crash if left unhandled.

[edited]

If redis doesn't mean anything to you, how about the mysql package? or mongodb? These do the same thing.

Izzmo commented 11 years ago

Well, let's be clear, there is a difference between throwing an exception and gracefully handling it and then throwing an exception because the application cannot move forward and it must crash. I assume this is what you are talking about when speaking on REDIS and other large frameworks.

gizmotronic commented 11 years ago

The redis client emits an error event when there is a connection error. It does not throw exceptions. You are able to intercept the error event and recover in any way you see fit.

Similarly, this pull request includes code to emit an error event when there is a connection error of any kind. It does not throw exceptions. You are able to intercept the error event and recover in any way you see fit.

I have probably been guilty of saying "throws an error" when I meant "emits an error event" a few times. They are not the same thing and I have been trying to be careful to say exactly what I mean. I'll say it again: an error will be thrown only if the bot creator doesn't include an error event handler.

In case you're wondering, by the way, the appropriate response to a truly unrecoverable error is to either throw an exception or to simply exit. There's no point in emitting an error event if you can't recover from the condition.

technobly commented 11 years ago

Whether some other package do it that way or NPM says you shouldn't, think about the impact of doing it here...

It's not intuitive to most coders that use this API ... even amongst us that contribute to the TTAPI, let alone the likes of many inexperienced javascript coders.

And we have a way to handle the disconnects... let's call it what it is and not confuse anyone further. It's not an error, it's a disconnection. When we call it what it is, it's easy to talk about and explain and most importantly it doesn't automatically throw an exception and crash your app if you do not handle it.

technobly commented 11 years ago

Please also look at the first reply of this thread, we keep saying the same thing over and over...

Let's bring it to a vote:

OPTION 1: Rename 'wserror' to 'error', thereby emitting 'error' messages when we are disconnected from turntable for various reasons. If you don't handle this 'error' message in your bot it will crash.
OPTION 2: Rename 'wserror' to 'disconnected', thereby emitting 'disconnected' messages when we are disconnected from turntable for various reasons. If you don't handle this 'disconnected' message in your bot, it won't crash.

I vote OPTION 2.

gizmotronic commented 11 years ago

What you say my code does is not what my code does.

I'm sorry if this offends you but I'm not concerned with how inexperienced coders might view this. Do it the right way and give experience to inexperienced coders. We could easily update the docs to say "handle this event". That's already the case for the ready event, though it wasn't even documented until yesterday.

Redis is hardly the only example of how to do it the right way. Spend 30 minutes looking for examples and you'll come up with dozens of them. It is unwise to create our own way to tell the user about errors when other well-known frameworks have already established one.

Ask yourself what happens with a TTAPI-based bot when there's any sort of interruption right now. If it doesn't handle the wserror event, what does it do? Nothing. At all. It just sits there until you kill it. Please don't complain about having to handle an event. You currently have to handle an event to do anything meaningful, anyway. At least in the case of the error event something happens that you can throw into Google to search for.

samuri51 commented 11 years ago

the majority of bot owners are users not creators. and yes you have to take them into consideration. i vote for @DubbyTT 's option 2, emit an error message but do not crash. javascript is one of the only languages i've seen where extremely general catch all error message's like this would even be considered. in java you have a vast hierarchy of error messages that tell you exactly what will happen as your programming the thing. personally i think just calling it "error" is unprofessional and tells you absolutely nothing.

gizmotronic commented 11 years ago

The error event name is not the error event message. The error event message is passed to the error event handler (in the Error object).

I've said this in one way or another repeatedly but keep hearing the same objection. Please stop insisting that my code does something that it does not do.

technobly commented 11 years ago

Well I've tried to break it down very simply... that IS what your proposed change will do, you've even said it yourself.

Beyond that, I've already said we need to force a log when emitting 'disconnected' to the effect of something like "DISCONNECTED FROM TURNTABLE".. to give them hints as to what they should do to handle it, if they are not already. If they are... no harm to your logs really.

I really hate when people shove something down your throat because that's the INDUSTRY STANDARD WAY of doing it... or THAT'S THE WAY IT'S ALWAYS BEEN DONE... well I have news for you, STANDARDS CHANGE EVERY SECOND and if you don't use your brain you will always be FOLLOWING and not LEADING.

Please cast your vote so we can settle this once and for all.

technobly commented 11 years ago

And I think you mean to say it is an event named "error".

I've been calling it a message, because that's how I think of it and being transmitted from the TTAPI to your bot code. You are correct that it is an event

It still crashes the bot though if you don't provide a listener for that event named "error".

alaingilbert commented 11 years ago

I'm looking at the whole topic right now... But I don't see why it would crash the bot if you name the event error instead of wserror !(?)

gizmotronic commented 11 years ago

@DubbyTT, I'd rather you just say "I don't like this change" instead of making appeals to emotion.

@alaingilbert, it's a special case of the events framework. If you don't handle the error event, an exception is thrown. I deliberately chose this because

the current situation means you'll never know if you were disconnected, unless you write an event handler, and
I see lots of bots that restart automatically whenever they crash (out of memory, connection error, coding error, etc.)

technobly commented 11 years ago

@alaingilbert http://nodejs.org/api/events.html#events_class_events_eventemitter

When an EventEmitter instance experiences an error, the typical action is to emit an 'error' event. Error events are treated as a special case in node. If there is no listener for it, then the default action is to print a stack trace and exit the program.

@gizmotronic not understanding you again. I don't like this change because of the way it functions.

alaingilbert commented 11 years ago

Honestly, I think I prefer when the bot crash. For example, at the moment, if the whichServer function fail, we only receive this message Failed to determine which server to use: then the bot continue running. According to me it should have crashed. (If the debug flag is set to false, we see nothing at all...)

gizmotronic commented 11 years ago

FYI, to get the error message:

bot.on('error', function (err) {
  console.log(err.message);
});

The err parameter is always a JavaScript standard Error object.

technobly commented 11 years ago

@alaingilbert the reason I don't like the bot to simply crash is because we have a way to prevent that, the autoreconnect.js example shows this. If we call the error "error", then you have to drill into the messages and try to do some matching crap and if it matches one of three different error messages then you would try to reconnect.

If we call it 'disconnected' and log the message regardless of the debug flag, it would be the best case... you get a nice message telling you the bot disconnected. If you come to the TTAPI docs and look for that, you'll see that you should be handling the .on('disconnected', function(data) { } ); event, with a nice example to go with it.

If some time in the future you start collecting things that really are errors we can't do anything about, then go ahead and use the 'error' event for that ... but right now we are talking about three things that are all "disconnected" events that we know how to handle... so why would you want to crash the bot for that?

gizmotronic commented 11 years ago

@DubbyTT, I am proposing emitting an error event only in the case where there's a connection error. This is a common convention. You do not need to drill into it unless you want to. The intent is that the appropriate response to an error event would be to reconnect.

It's misleading to call it "disconnected" because at least in some cases where you'd want to retry the connection, you may have never been connected in the first place (specifically errors on the HTTP request in whichServer). But, as I've said before, the end result is the same: it's a reason to retry the connection.

MikeWills commented 11 years ago

I would rather have option 2. The reason? I can handle disconnects separately from other errors. Unless, the error event tells me it is a disconnect/not connected/banned error. Then I don't give a fuck.

gizmotronic commented 11 years ago

You will get a booted_user event when you're removed/booted/banned/whatever from the room. It has nothing to do with this discussion.

technobly commented 11 years ago

Wether or not you were ever connected, if you are not connected... 'disconnected' is a state... and it implies NOT CONNECTED :)

It seems almost wreckless to bottle up all connection style errors into the 'error' event. How exactly is it implied that you have a connection error?

alaingilbert commented 11 years ago

I think (I'm really unsure) that the following code would be best:

@emit 'error', new TurntableError 'Error parsing server response'
@emit 'error', new HttpError "whichServer error: #{e}"
@emit 'error', new TTApiError 'No response from server'

So it would be possible to make a distinction between each errors. ex:

bot.on('error', function(err) {
  if (err instanceof TTApiError) {
    ...
  }
});

We could also make a distinct error for each ttapi errors. (TTApiError100, TTApiError101...)

I am pretty sure that it is a good thing for the bot to crash when it lost the connection for whatever reason. And then we should be able to make something to fix the situation (catch the error).

@DubbyTT The example you made to reconnect the bot is cool, but we shouldn't force everyone to implement it. And at the very moment, if they don't implement it, and the bot crash (without crashing) the bot will stay idle forever ! (as I understand).

Your thoughts ?

technobly commented 11 years ago

All of those errors are going to result in disconnections... so to get the bot reconnected you need to do something like this:

bot.on('error', function(err) {
  if (err instanceof TurntableError || err instanceof HttpError || err instanceof TTApiError) {
    // run your reconnect code
  }
});

as the TTAPI gets updated and new reasons for disconnects are found, more errors will be added each time requiring the bot creators to figure out what the new errors are.

My proposal is still to @emit 'disconnected' for disconnect issues.

bot.on('disconnected', function(data) {
  // run your reconnect code
});

if the TTAPI adds new reasons to emit 'disconnected' events, everyone's bot code is futureproof.

Process sitting idle is not actually a bad thing in my opinion... unless you have run your process with a way to pause it on crashing, the window will close (on windows) and you won't even know why. And if you log that the bot was 'DISCONNECTED FROM TURNTABLE' it will be pretty apparent what happened.

We don't have to force anyone to implement my autoreconnect code, only suggest it somewhere in the documentation.

Emitting 'error' forces everyone to implement the .on('error', function(data) { } ); handler though... if they don't want their bot crashing unexpectedly.

alaingilbert commented 11 years ago

«Emitting 'error' forces everyone to implement the .on('error', function(data) { } ); handler though... if they don't want their bot crashing unexpectedly.» I understand this.

But not doing it, means everyone's bot are running for nothing. (idle forever)

gizmotronic commented 11 years ago

@alaingilbert yes, custom error objects would be a huge further improvement. I had originally done something like this but rejected it because the only way I see to do it involves adding and exporting new classes. (The TTApiError object would have to exist in the user's namespace, too.)

I can resurrect my changes if you like, and push them into this request.

technobly commented 11 years ago

Assuming you have forgotten that you have a bot running, up in the cloud let's say, most cloud based nodejs websites will terminate your process if it is idle for a period of time.

If you are actually USING your bot, it's going to be apparent pretty quickly that it's not running anymore, and you'll go check and see a log message that it was "DISCONNECTED FROM TURNTABLE".

Even if you implement the 'error' listenter in your bot code, the bot can hang idle if you don't stay up to date with what all of the errors are that you should be trying to reconnect on.

One way to do the 'disconnected' event AND make the bot crash ... is add another test case to that inactivity timer that says if the inactivity is greater than 6 hours (give your internet, power or turntable some time to come back online), then you can process.exit(1); right there.

Edit: That's all I have to say about this... have fun with whatever you come up with...

alaingilbert commented 11 years ago

@DubbyTT Let say you run 20-30 bots at the same time (ttdashboard / ttstats...) You can't know for sure that they are working or not.

MikeWills commented 11 years ago

Okay... let's all take a step back for a minute. Get your heads out of the details of HOW it works and look at HOW a beginner might come into this.

bot.on('disconnected', function(data) {
  // run your reconnect code
});

Is self-documenting. When looking through the API documentation and I see that, I will make a note to myself to look into that more. It would appear like I would need that if my bot disconnections for some reason.

Now let's look at the error event:

bot.on('error', function(err) {
  if (err instanceof TTApiError) {
    ...
  }
});

When I glance at that, I think that is handling various exceptions that might occur out of the API's control. It might be disconnects, but more so I would NOT be think about disconnects. In fact, I would assume that is handled automatically and not add it in my bot... if I wasn't reading the API docs in detail.

I know all about digging into the details of how it works, but sometimes you need to step back and look at it from another perspective and think about usability.

I don't have a bone in this other than being able to detect when a bot of mine is offline and bring it back online easily. That is the ultimate goal right?

alaingilbert commented 11 years ago

I completely agree with you @MikeWills .

Except that every error received make the reconnection needed.

bot.on('disconnected', function(data) {
  // run your reconnect code
});

vs

bot.on('error', function(data) {
  // run your reconnect code
});

alaingilbert / Turntable-API

Improved handling of unexpected disconnects #228