Closed khera closed 8 years ago
Am seeing some errors when publishing, either truncated data, or too much data. Have included the log output from gnats which includes the last successful publish, and then the publish that causes a parsing error.
[95952] 2016/07/08 17:57:00.539477 [TRC] 127.0.0.1:35663 - cid:1 - ->> [PUB foo 13]
[95952] 2016/07/08 17:57:00.539493 [TRC] 127.0.0.1:35663 - cid:1 - ->> MSG_PAYLOAD: [Hello, World!]
PUB foo 13]6/07/08 17:57:00.539536 [TRC] 127.0.0.1:35663 - cid:1 - ->> [PUB foo 13
PUB foo 13'6/07/08 17:57:00.539594 [ERR] 127.0.0.1:35663 - cid:1 - Error reading from client: processPub Parse Error: 'foo 13
[95952] 2016/07/08 17:58:07.003507 [TRC] 127.0.0.1:35749 - cid:5 - ->> [PUB foo 13]
[95952] 2016/07/08 17:58:07.003513 [TRC] 127.0.0.1:35749 - cid:5 - ->> MSG_PAYLOAD: [Hello, World!]
[95952] 2016/07/08 17:58:07.003519 [TRC] 127.0.0.1:35749 - cid:5 - ->> [PUB foo 13]
[95952] 2016/07/08 17:58:07.004000 [ERR] 127.0.0.1:35749 - cid:5 - Error reading from client: Client Parser ERROR, state=30, i=23199: proto='"\nHello, World!\r\nPUB foo 13\r\nHell"...'
I did not alter the publish (output) side at all. All I updated was the
reading (input) routines. I just replaced the read_line()
and read()
methods from the client, and implemented them using primitives I added to
the connection object.
Could you share which test script you were running that caused this? I'll try to reproduce.
I found a problem with the writes... because the socket is marked as non-blocking, syswrite will occasionally return "EAGAIN" when it cannot complete the write. I'm working up a solution to this. I'm trying to do it without dispatching to another method as that seems to incur an almost 21% penalty in the speed of publish().
I fixed the bug with send. I moved around a little bit of code so all I/O is done inside the Connection object. I'm not totally sure how you like to structure the code, but this made sense to me.
Overall, the change to the publish() method has made it significantly slower according to the benchmark-pub.pl script you include. With the blocking write, I can get almost 700k publish/s but with this "safe" code it drops to around 490k/s.
The code as written now is correct and very solid based on my testing. I cannot make it lose messages outside the guarantees of NATS itself.
@khera Change is looking solid as, nice 👍
The performance change can be looked at later, having your timeout change is more important at the moment.
The timeout was improperly implemented using select() on a buffered file handle. This cannot work because there will be data to use in the buffer even if the handle is not ready to read, thus we deadlock.
We must use our own buffering so we know to call select() only when we are certain we need to fetch more data from the handle.