Unknown tag creates panic

FrenchBen commented 8 years ago

From our Docker for Mac bugsnag error: panic·unknown tag received

Stack trace

panic unknown tag received 
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/panic.go:481 panic
    .../vendor/github.com/stevvooe/p9p/transport.go:173 (*transport).handle
    .../vendor/github.com/stevvooe/p9p/transport.go:41 newTransport

Couple of places where the p9p library is used: https://github.com/.../blob/master/v1/abstractions/watch.go https://github.com/.../blob/master/v1/abstractions/client.go https://github.com/.../blob/master/v1/docker_proxy/volumes.go

stevvooe commented 8 years ago

Looks we are hitting this condition: https://github.com/docker/go-p9p/blob/master/transport.go#L173. Throwing a panic here is incorrect, but it would be good to better understand the protocol state that hits this condition.

We'll probably need to build a tracing Channel type to trace the protocol. Application logs before the panic may help to identify the error, as well.

Small nit: please update the import path to github.com/docker/go-p9p.

FrenchBen commented 8 years ago

CC @djs55 @dsheets

kennylevinsen commented 8 years ago

You tag handling is definitely weird. It would seem that every 65536th message you send is invalid, due to not skipping NOTAG (0xFFFF), a tag reserved for Tversion. Furthermore, your tag handling disregards existing tags, meaning that you will invalidate any pending requests if enough messages fire.

The relevant code is at https://github.com/docker/go-p9p/blob/master/transport.go#L158. You can see how I dealt with the issue at https://github.com/joushou/qptools/blob/master/client/raw.go#L126.

And indeed, a proper error should be thrown. Even if the client is proper, the server may violate protocol. The proper thing to do is probably to close up shop and error out nicely for this connection.

stevvooe commented 8 years ago

@joushou Most of this is described here, but I am not sure I understand how the tag allocation is related to this panic. There is no evidence suggesting that this is happening do to the receipt of NOTAG, so your assertion is a bit of a leap. It is more likely that the client is receiving a response to a request that it has long since abandoned, since we know #4 dealt with this issue, so some degree.

We actually had a similar tag allocation pool, but found the locking at that point to be unnecessary. This can probably be fixed by simply incrementing, then checking outstanding. On a collision, continue incrementing. I've filed #5 to address this.

For this particular bug, the question is what is the correct error? The client is receiving a message that it is no longer listening for. This doesn't necessarily mean that there is a protocol error, so shutting down the session is incorrect. It might be sufficient to log and ignore, but research is required.

kennylevinsen commented 8 years ago

If a tag for a pendinf request is overwritten, two requests on one tag will be fired. this either involves violating protocol if the server hasn't already sent the response for the first request, in which case it may so whatever, including terminating the connection or sending two responses in a row on the same tag, which will hit the panic.

It can also happen without violating protocol, as your reader can be blocked on the channel trying to push the response while a request allocated the tag again - the old response then goes the wrong way, and the new response will be on an unknown tag, which will hit the panic.

In my experience, a full tag loop, even for simple uses, can easily happen in a second, and trampling on existing connections does sound plausible.

A request using NOTAG violates protocol, and a server may do whatever. A proper server will probably just Rerror, but who knows - the server might assume a sane client.

The locking is not relevant for you, as you only have one point of access (the servicing goroutine). There is also an aspect of tag handling related to its allocated state that doesn't matter. This is due to my transport equivalent only providing a tagpool, it doesn't perform the allocation itself.

In my implementation, tags are part of the message, as the user needs to know the tag in order to make a Tflush message, so I did not want the transport to mess with it. You're probably going to have fun implementing Tflush in your current design.

If it's neither 1 or 2, and your message decoder didn't decode the invalid message incorrectly, then it might be the server.

As for the handling, assuming the client is correct and the server sends a response on an unknown tag, then it is not safe to make assumptions about what state the connection is in. The server sent you a response you didn't ask for, or sent a response the wrong way. This might mean that you're waiting for an Rwrite to keep writing, leaving a file in an inconsistent state because the write didn't finish. It could also mean that the server parsed a message wrong or otherwise decided to do something entirely unexpected.

However, I have never had this happen in real life, and I doubt that it's something you should fear. I broke message decoding once causing tag errors, but a server doesn't respond to a nonexisting tag out of its own initiative unless something is very broken. Limbing away from errors can be a quite dangerous practice.

kennylevinsen commented 8 years ago

But indeed, if the timeout is "delete tag from pool", then that will fail too the moment is timeout is met. I just looked at the tagpool and concluded that itsg collisions will happen.

kennylevinsen commented 8 years ago

Damn autocomplete. s/itsg/tag/

stevvooe commented 8 years ago

@joushou Thanks for your commentary here! It will be helpful in ensuring that this implementation can be robust.

In my experience, a full tag loop, even for simple uses, can easily happen in a second, and trampling on existing connections does sound plausible.

This is good to know. During implementation I wasn't sure how long session lifetimes would be.

You're probably going to have fun implementing Tflush in your current design.

We are actually handling Tflush within the transport and server, automatically. Clients can cancel an operation through the context and the flush is handled automatically. There are some details that are nasty to get right for flush handling and we thought it would be better to hide these in the implementation.

Since most of this is more related to #5, do you mind if we take this conversation there?

stevvooe commented 8 years ago

@joushou Have you seen this since the changing the tag allocation code?

kennylevinsen commented 8 years ago

Asking OP might be better. I couldn't trigger any problems during my tests, but I don't use the library daily. I just fixed the tag allocation code to ensure that the client didn't break protocol in that area. Other protocol botches (including server side) could also cause the error reported by OP.

stevvooe commented 8 years ago

@FrenchBen Same question to you ^.

FrenchBen commented 8 years ago

@stevvooe still getting it:

panic unknown tag received: Rread(20250) 
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/panic.go:481 panic
    .../vendor/github.com/docker/go-p9p/transport.go:178 (*transport).handle
    .../vendor/github.com/docker/go-p9p/transport.go:41 newTransport

kennylevinsen commented 8 years ago

Sure the patched lib is in use? It might be server side, then.

Best regards, Kenny Levinsen

On 18. aug. 2016, at 02.44, French Ben notifications@github.com wrote:

@stevvooe still getting it:

panic unknown tag received: Rread(20250) /usr/local/Cellar/go/1.6.2/libexec/src/runtime/panic.go:481 panic .../vendor/github.com/docker/go-p9p/transport.go:178 (*transport).handle .../vendor/github.com/docker/go-p9p/transport.go:41 newTransport ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

FrenchBen commented 8 years ago

@joushou just grabbed the latest to be safe and will report back

docker-archive / go-p9p

Unknown tag creates panic #2