crash when sending multiple commands in rapid succession

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
telnet to server
paste 2 lines:
  login wiznname password
  @set name on void to newname

What is the expected output? What do you see instead?
normal login, plus message about name of room being changed.

Instead get:
I am having trouble with that command.
Try "@help set"
Debug
'NoneType' object has no attribute 'name'
You have been teleported.
void
You look at the void.

And in the log:
player wiznname logged in
Traceback (most recent call last):
  File ".../src/tzprotocol.py", line 412, in dispatch
    func(self, rest)
  File ".../src/wizard.py", line 112, in cmd_set
    obj = find(r, s.room, s.player, s.room)
  File ".../src/share.py", line 1120, in find
    if objname == room.name:
AttributeError: 'NoneType' object has no attribute 'name'

Original issue reported on code.google.com by miss...@hotmail.com on 30 Dec 2010 at 4:03

GoogleCodeExporter commented 9 years ago

r446 prevents the specific race condition mentioned here.

When a new line comes in before the login process has completed, the new line 
is held for a bit, then sent back through again.

I'm not going to close this issue yet, as I suspect other race conditions may 
be lurking. I want to see if I can find an example.

I'm not sure it will be possible to trigger the problem by simply pasting long 
strings of commands. I think this fix will stop that from being a problem. 
However, I think that with certain delays between commands there may be 
problems.

I may need to create a test client that can send strings of commands at 
specified intervals to trigger the problem.

(A test client that can send scripts of commands and check for expected output 
would be good anyhow, so maybe testing for race conditions could be part of a 
more general test setup).

Original comment by miss...@hotmail.com on 31 Dec 2010 at 5:39

Changed state: Started

GoogleCodeExporter commented 9 years ago

Thanks, the revision seems to stop all the issues I was having.

Here's an example where you see something that should be atomic but isn't: even 
if everything works quite right, the output doesn't appear in order...

------ paste this, quickly: ------
@dig a to templo
.a
@destroy door
..................................

Expected output:
------
You dig a to templo.
templo
You speak a word of power and ...
------

What we get:
------
You dig a to templo.
templo
You speak a word of power and ...            
A new exit door appears.
------

The message "A new exit door appears" is being sent to those who are in room 
"templo" which is correct, but it should appear when the @dig is done. Since 
I've already used that path to go to templo (.a), when I arrive there the door 
should be there already, so I shouldn't arrive in that room in time to see the 
message...

Original comment by marcos.m...@optimus.pt on 31 Dec 2010 at 10:52

GoogleCodeExporter commented 9 years ago

OK, I looked at the code and found out who's the culprit. This is here "by 
design", but I don't understand why, and since this code exists since r2, I 
don't have a way to get a clue about your intentions... Hoping you might help 
me here ;-)

At rooms.py, the action function waits (asyncronously) 0.1 before actually 
doing its work. This is on purpose, and documented:

"
        Uses a callLater to make sure that reactions to the action occur
            after the action has been reported.
"

What I don't understand is how can you achieve this by having a callLater, on 
the contrary: if you want to have reactions to the action occur after the 
action has been reported, then the operation you're doing should react first, 
report after, right? In this case, the action is "report an action" (say to 
everyone near this room that a new exit door appeared), but if you only say it 
sometime *after* the door creation, instead of right away, then you aren't 
making sure that reactions only occur after the the action was reported, in 
fact you're opening window to let reactions actually occuring before the action 
being done...

Anyway, my temp fix is taking that async timer out of the way, but I'm sure 
that you have a good reason to have that timer there, so maybe the correct fix 
is to have the possibility of flagging room.action to tell if we want it sync 
or async...

Original comment by mindboos...@gmail.com on 8 Nov 2011 at 7:53

GoogleCodeExporter commented 9 years ago

Another action that isn't as atomic as it should: purge.

1) Create a user "a" with password "a";
2) login a a
3) in another terminal, connect to the MUD and paste this in the terminal:

purge a a
login a a

When "purge a a" ends executing the user "a" should be completely purged, but 
it isn't, so when "login a a" happens, we still have an "a"... the result:

"
Connection purged.
Log in with "login <name> <password>"
You feel yourself being ripped away from where you are...
You have been teleported.
void
You do not see that here.
"

After this, you won't be able to type anything (since in the meantime the purge 
really ends its effects).

From the log:
2011-11-09 19:53:36+0000 [TZ,1,127.0.0.1] player a logged in
2011-11-09 19:53:36+0000 [TZ,0,127.0.0.1] Lost a client!

As you can see, the "Lost a client!" only happens *after* the "player a logged 
in"...

Original comment by mindboos...@gmail.com on 9 Nov 2011 at 7:57

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Here's a patch fixing this:
https://github.com/marado/tzmud/commit/4825c6620675ffecdc3dba599384ae3d4152de23

Twisted won't let us have a sync _purge, so I'm not sure if there any other 
things that depend on the purge actually being complete other than login, but 
if there are...

A safer solution would be to actually have a isLogged() function that doesn't 
only check the value of the is_logged boolean... But, well, this works.

Original comment by mindboos...@gmail.com on 9 Nov 2011 at 9:17

GoogleCodeExporter commented 9 years ago

Probably I was just coding defensively. Other than follow, I can't find any 
actions that require a delay, and follow has its own separate delay built in.

It looks like an action can specify a delay other than the default 0.1 seconds 
if it wants to. It's possible that dig should specify no delay.

Maybe what we should do is have the default be no delay, but still keep the 
code there that will allow an action to specify a delay if needed.

Original comment by miss...@hotmail.com on 11 Nov 2011 at 6:17

GoogleCodeExporter commented 9 years ago

Well... If we don't have any delay there (see my patch here: 
https://github.com/marado/tzmud/commit/5bfe1644da16303fd053826a6c32ad01be77d7cc 
), we can just delay the action call itself in the cases we need to, right?

Original comment by mindboos...@gmail.com on 11 Nov 2011 at 8:05

GoogleCodeExporter commented 9 years ago

Hmm...

Looking closer, I see that teleport is using the delay values:

                room.action(dict(act='teleport_character_away',
                                    delay=0.2,
                                    actor=None,
                                    character=self))
            reactor.callLater(0.4, self.move, destination)
            destination.action(dict(act='teleport_character_in',
                                        delay=0.6,
                                        actor=None,
                                        character=self))

So, it is trying to order those events in to the proper sequence.

How is that holding up under the no delay patch?

Original comment by miss...@hotmail.com on 11 Nov 2011 at 8:53

GoogleCodeExporter commented 9 years ago

You're right, yes. But we can just reactor.callLater(0.4, room.action, dict) 
and reactor.callLater(0.6, destination.action, dict) instead of calling those 
actions now and telling them to only execute in 0.x ...

Original comment by mindboos...@gmail.com on 12 Nov 2011 at 2:42

GoogleCodeExporter commented 9 years ago

Right. I think the code path works out exactly the same.

I guess it is not as clear that that is an asynchronous call, but to my eye, it 
looks cleaner with the delay= value in the dict.

Either way, I don't think the behavior in the face of a bunch of rapid calls 
involving teleport is what would (or should) be expected ...

@teleport to house
@teleport
@teleport to house
@teleport

You feel yourself being ripped away from where you are...
You feel yourself being ripped away from where you are...
You feel yourself being ripped away from where you are...
You feel yourself being ripped away from where you are...
You have been teleported.
void
You have been teleported.
void

While someone watching sees:
lee waves his hands around mysteriously.
lee waves his hands around mysteriously.
lee waves his hands around mysteriously.
lee waves his hands around mysteriously.
lee disappears.
lee disappears.
lee disappears.
lee disappears.
lee appears.
lee appears.

Or even just:
lee appears.
lee appears.

In your work with other MUDS and talkers, have you seen any kind of testing 
framework for this kind of thing?

I think before any major changes begin, it would be good to set up tests that 
specify exactly what should happen given a particular sequence (and timing) of 
commands.

Maybe like:

=w1  # logs in character wizard 1
=p1  # logs in character player 1
=p2
+p2 house  # input from player 2
-w1 p2 leaves to the light.  # output seen by w1
-p1 p2 leaves to the light.
-p2 house
1.5  # wait here for 1.5 seconds before sending next line
+w1 @teleport house
-p1 waves his hands around mysteriously.
-w1 You feel yourself being ripped away from where you are...
-p1 disappears.
-w1 house
-p2 w1 appears.

Original comment by miss...@hotmail.com on 12 Nov 2011 at 2:19

GoogleCodeExporter commented 9 years ago

> I guess it is not as clear that that is an asynchronous call, but to my eye, 
it 
> looks cleaner with the delay= value in the dict.

Either way works for me.

> Either way, I don't think the behavior in the face of a bunch of rapid calls 
> involving teleport is what would (or should) be expected ...

You're right, it isn't. The problem is that in lots of stuff, including 
@teleport, we shouldn't have anything asyncronous, at all. @teleport should:

1) on origin place, say that player "waved his hands"
2) remove character from origin place
3) tell player that he feels being ripped away
4) tell people on origin place that player has disappeared
5) put player on new place (I assume that this function triggers the player 
seeing the place he now is)
6) say in new place "player x appears"

Each of this items must be done in this order, and must be syncronous - meaning 
that each function will only return when all it is meant to do is done.

> In your work with other MUDS and talkers, have you seen any kind of testing 
> framework for this kind of thing?

Nothing, sorry, most of the MUDs and talkers I've seen only work 100% 
syncronously, some being only one non-threaded process...

Original comment by mindboos...@gmail.com on 7 Dec 2011 at 7:11

GoogleCodeExporter commented 9 years ago

OK, I restarted looking at this, and decided that I should investigate the 
teleport issue in particular. Threading through the code, I realized this:

==========
@teleport house
You feel yourself being ripped away from where you are...
@teleport
You feel yourself being ripped away from where you are...
@teleport
You feel yourself being ripped away from where you are...
You have been teleported.
void
==========

Which means that:
1) When you're teleported to somewhere else, you have say "You feel yourself 
being ripped away from where you are...";
2) When you're teleported to the same place you already were, you also see 
(afterwards), "You have been teleported." and the place where you were 
teleported to (which is always where you were already).

I have a couple of considerations here already:
a) do we really want that it is possible to teleport something to where it 
already is? I'm OK with either choice, but it makes more sense to me if you 
simply can't teleport to where you already are...;
b) We obviously shouldn't have a different behaviour, so the question is: 
should we display "You have been teleported." and where to... always or never? 
Once again, I'm OK with either choice, but it makes more sense to me if we 
always display that.

Now, I went to investigate *why* we were showing that info sometimes, and some 
other times we weren't. Seems that the difference is in the function 
"near_teleport_character_in", which is meant to be a function to warn those who 
are in some place that they suddenly started seeing something because it was 
teleported next to the player. IMHO someone who is teleported to where it was 
an instant ago doesn't see anything, even they feel something. So, I think we 
should say "You feel yourself being ripped away from where you are..." but it 
doesn't make sense saying "You have been teleported" (at least not here in the 
code, where the sentence actually means "you see yourself appearing here 
because you have been teleported", which makes no sense) - and since you're in 
the same place it also doesn't make sense that you decide to look around. 

Here's a patch fixing this, please push it to your own code if you agree with 
my reasoning:
https://github.com/marado/tzmud/commit/a952f0da1318fd22750b4fd3f5cdab067909daba

Original comment by mindboos...@gmail.com on 9 Mar 2012 at 3:00

GoogleCodeExporter commented 9 years ago

Back - sorry for the delay in giving you a complete reply. To answer your 
question:

"How is that holding up under the no delay patch?"

The answer is: applying the no delay patch *and* the one in c13, the no delay 
patch still wasn't enough. A quick look was enough to find out why: between 
those two actions with delay (that the "no delay" patch makes synchronous) 
there's this call:

reactor.callLater(0.4, self.move, destination)

which is obviously an async call. 

Replacing it with "self.move(destination)" fixes the @teleport issues. 

So - as far as I am concerned - the issues on *this* bug report are fixed with 
the two aforementioned patches *and* this replacement. 

Now, if we do that, we should create another issue named "validate all 
callLater calls to see if we really want them async". A quick grep shows things 
that surely are meant to be async (database doing pack_regularly, !shutdown 
with delay), but others are probably like teleport, so if we go this way 
there's that extra work needed. 

If you're OK with this solution, I am offering to do the remaining work (create 
that issue, validate each callLater, produce needed patches to fix this once 
for all).

Original comment by mindboos...@gmail.com on 11 Apr 2012 at 7:02

cryptixman / tzmud

crash when sending multiple commands in rapid succession #44