Open rpcope1 opened 10 years ago
@rpcope1, I think you are mixing two different problems while proposing a workaround on the second. Closing the socket, opening a new one, moving existing pending operations on from the old to the new and retrying the outstanding ones seems a bit too much for an operation that timed out.
Adding a retry=False, retry_count=3
to the operations is possible. But it will only retry on EXPIRED or SERVICE_BUSY responses. Everything else implies a terminal failure and makes no sense to retry.
Now, the second issue is probably related to faulted sockets due to the drive closing the connection or the network getting disconnected.
Doing close()
then connect()
should do almost what you are asking for, adding a explicit reconnect()
method that can only be called on closed/faulted (not new) that continues with the operations left in the internal queue is possible.
Important: queued operations pending processing and operations waiting for reply from the drive are NOT the same. Operations that already left for the drive will not be requeued on reconnect()
.
Note: forget about internal operation sequence numbers. Sending the same operation twice with the same sequence number DOES NOT equal a retry, it is a replay attack and will be detected and discarded by the drive. There are NO semantics on the protocol right now to tell the device, "hey dude, by the way this is the n'th attempt on this operation".
One thing that would help all of the engineering groups here at LCO, especially reliability, and would be likely be a useful feature for external customers would be to add the ability to cache the data for outstanding commands (much like you cache the handlers in a queue with the corresponding sequence number), and add a threadedClient/asyncClient method called retry that dumps the existing socket, opens a new one, and resends all the commands in the queue, with each one still attached to the (possibly same) sequence number and callback handlers. This allows us to keep moving forward in a relatively painless way if one command in a sequence times out, or we have a major fault somewhere in the stream of commands. I would expect the ability to save data for retries to be optional and must be specified explicitly, as it increases the memory demands linearly with queue size, and could push many systems that aren't chock full of RAM over the edge. of the client Does this make sense?