fix timout by using our own input buffer

khera commented 8 years ago

The timeout was improperly implemented using select() on a buffered file handle. This cannot work because there will be data to use in the buffer even if the handle is not ready to read, thus we deadlock.

We must use our own buffering so we know to call select() only when we are certain we need to fetch more data from the handle.

carwynmoore commented 8 years ago

Am seeing some errors when publishing, either truncated data, or too much data. Have included the log output from gnats which includes the last successful publish, and then the publish that causes a parsing error.

[95952] 2016/07/08 17:57:00.539477 [TRC] 127.0.0.1:35663 - cid:1 - ->> [PUB foo 13]
[95952] 2016/07/08 17:57:00.539493 [TRC] 127.0.0.1:35663 - cid:1 - ->> MSG_PAYLOAD: [Hello, World!]
PUB foo 13]6/07/08 17:57:00.539536 [TRC] 127.0.0.1:35663 - cid:1 - ->> [PUB foo 13
PUB foo 13'6/07/08 17:57:00.539594 [ERR] 127.0.0.1:35663 - cid:1 - Error reading from client: processPub Parse Error: 'foo 13

[95952] 2016/07/08 17:58:07.003507 [TRC] 127.0.0.1:35749 - cid:5 - ->> [PUB foo 13]
[95952] 2016/07/08 17:58:07.003513 [TRC] 127.0.0.1:35749 - cid:5 - ->> MSG_PAYLOAD: [Hello, World!]
[95952] 2016/07/08 17:58:07.003519 [TRC] 127.0.0.1:35749 - cid:5 - ->> [PUB foo 13]
[95952] 2016/07/08 17:58:07.004000 [ERR] 127.0.0.1:35749 - cid:5 - Error reading from client: Client Parser ERROR, state=30, i=23199: proto='"\nHello, World!\r\nPUB foo 13\r\nHell"...'

khera commented 8 years ago

I did not alter the publish (output) side at all. All I updated was the reading (input) routines. I just replaced the read_line() and read() methods from the client, and implemented them using primitives I added to the connection object.

Could you share which test script you were running that caused this? I'll try to reproduce.

khera commented 8 years ago

I found a problem with the writes... because the socket is marked as non-blocking, syswrite will occasionally return "EAGAIN" when it cannot complete the write. I'm working up a solution to this. I'm trying to do it without dispatching to another method as that seems to incur an almost 21% penalty in the speed of publish().

khera commented 8 years ago

I fixed the bug with send. I moved around a little bit of code so all I/O is done inside the Connection object. I'm not totally sure how you like to structure the code, but this made sense to me.

Overall, the change to the publish() method has made it significantly slower according to the benchmark-pub.pl script you include. With the blocking write, I can get almost 700k publish/s but with this "safe" code it drops to around 490k/s.

The code as written now is correct and very solid based on my testing. I cannot make it lose messages outside the guarantees of NATS itself.

carwynmoore commented 8 years ago

@khera Change is looking solid as, nice 👍

The performance change can be looked at later, having your timeout change is more important at the moment.

carwynmoore / perl-nats

fix timout by using our own input buffer #2