Open liorn opened 9 years ago
@awestendorf, I'll be happy to hear your opinion about this.
gevent
change will break haighaEventletTransport
, but can't because of this issue.Thanks!
This should only be a problem when you're using the same channel across greenlets/eventlets/threads though right?
True. It's a common use case, no?
It is, I just wanted to make sure I understood the problem correctly. The simplest case would be to create a channel, or pull one from a pool, for each publish on each thread. Although there's a limit to the number of channel ids, they're relatively light on both sides of the connection. There's haigha.channel_pool
, but that's old code for a transactional use case that I don't use any more. I see 3 possible ways to go with this.
1.
, improve the ChannelPool
so that it supports the non-transactional use case by implementing a lock around the publish.BasicClass
, which is an instance per channel and so can maintain its own lock relatively easily.In either case where we handle a lock, I think we shouldn't assume that the user has monkey-patched the thread library, and so there will have to be some abstractions that allow BasicClass
and/or ChannelPool
to use a lock constructor that would be defined within a Transport
implementation.
Actually, with regard to option 3
, that only solves the case where someone tries to publish on multiple threads. If they do any other operation, that too could result in a frame error. So the only way to really make that work well is to grab a lock inside every method that wants to call send_frame
, which I can see quickly leading to a lot of bugs (e.g. synchronous transport, try/finally
debugging, etc) and likely seriously slow things down, which in turn means I think most users would want the option to control that themselves, leading to even more complications.
I'm leaning in the direction of a helper such as ChannelPool
and documentation. An important but subtle detail such as this could easily be lost in documentation though.
Perhaps a way around all those complications and to fix the bug is to change send_frame
to send_frames
, so that a single logical "transaction" is carried all the way down to the transport layer, where the lock can be held while iterating over the whole list of frames to be sent.
That sounds simpler. I'll take a look at the code this weekend and try and submit a patch for this. Thanks!
Short update: the refactoring is more significant that I expected. We currently bypassed the issue with application-level semaphores, hope to get back to this soon.
When publishing a large message,
BasicClass.publish
creates multiple frames and sends them one by one. When using async transports such asgevent
, or the currently unsupportedeventlet
, there is no guarantee that we won't switch to another greenlet between frames. When multiple greenlets use the same connection to publish large messages, frames might get interleaved. Today, this doesn't occur withgevent
by chance - but future changes ingevent
s greenlet switching mechanism might break this.I've come across this problem with an
eventlet
Transport (which I will later contribute tohaigha
), in which the problem occurs. It's easy to reproduce - I spawn multiple publishers which continuously publish large messages, and very soon after I get disconnected from RabbitMQ with this message (or a similar one): _Connection closed: 505 : UNEXPECTEDFRAME - expected content body, got non content body frame instead Tracing the operations done shows the exact problem - multiple greenlets are waiting on the singleEventletTransport._write_lock
(similarly toGeventTransport._write_lock
), andsend_frame
calls are interleaved.A possible solution would be to acquire and release a semaphore for each message, not each frame... but the locking mechanism, today, is part of
BasicClass
and notTransport
thus can't be library specific.