Negative number of lost ZMQ messages

unsystemizer commented 6 years ago

I noticed this negative number of lost messages. Maybe the formula isn't completely correct.

root@indexd:/indexd# forever index.js
warn:    --minUptime not set. Defaulting to: 1000ms
warn:    --spinSleepTime not set. Your script will exit if it does not stay up for at least 1000ms
2017-10-28T12:08:15.260Z index Initializing blockchain connection (for testnet)
2017-10-28T12:08:15.350Z index starting API server
2017-10-28T12:08:15.357Z index App listening on port 18432
2017-10-28T12:23:43.160Z zmq -494 messages lost
2017-10-28T12:23:43.451Z zmq 493 messages lost

deweller commented 6 years ago

For some context here, this is from running: https://github.com/CounterpartyXCP/indexd-server

But the code is almost a direct copy-and-paste from https://github.com/bitcoinjs/indexd/blob/master/example/service.js.

dcousens commented 6 years ago

To quote @laanwj from https://github.com/bitcoin/bitcoin/pull/7762#issuecomment-212895281

sequence number goes to 0 (and wasn't at 0xffffffff) -> bitcoind restarted sequence number skips a beat -> send buffer was full

If the number displayed is negative, from ${sequence - lastSequence - 1} messages lost, then it means sequence was probably 0 [at some point]. This might mean that bitcoind restarted unexpectedly.

unsystemizer commented 6 years ago

I'll have to repeat this exercise, but I think I get more than just one pair of those. Could it be that RPC request queue full errors trigger these? Please don't close this just yet. Although it's a cosmetic issue it'd be good to know why it happens.

Edit: I found one edge case, but there may be others - if you run indexd in foreground and bitcoind in background (daemonized), then if you lose SSH connection to the system, that will cause lost messages.

dcousens commented 6 years ago

@unsystemizer use tmux or screen? I have no idea what state the programs will be in if you are dropping connections sporadically, but lost messages seems like the least of your issues.

unsystemizer commented 6 years ago

I use just Putty to SSH to the server and then SSH to the container (all on local LAN). I did notice a lot of these yesterday as well when I didn't have any disconnects (I ran a full reindex of testnet to compare time vs. addrindexd), but lets keep this closed. I suspect it's a matter of overloading the system.

dcousens commented 6 years ago

@unsystemizer I read somewhere (can't find it now) that you were hitting the RPC work depth, so I patched yajrpc to handle how bitcoind handles that error to ensure the error is thrown properly and not interpreted as a null result.

Let me know if that error isn't clear in future.

unsystemizer commented 6 years ago

@dcousens at first yes, because in indexd-server's default .env it's set to 32, and Core defaults to 16. I changed that 12 on both sides, hoping to put less pressure on the system. Maybe I should also lower the batch size or try to make some other adjustments, but seeing that Core still has 2 active RPC related bugs (crash/freeze), I gave up on this until they fix those bugs.

dcousens commented 6 years ago

2 active RPC related bugs (crash/freeze)

Link? I haven't experienced this in production.

unsystemizer commented 6 years ago

I haven't either, but they seem reproducible & real. There's one for valid and one for invalid (spam) calls: https://github.com/bitcoin/bitcoin/issues/11368 https://github.com/bitcoin/bitcoin/issues/11322 Edit: https://github.com/bitcoin/bitcoin/pull/11593 (PR)

I just deleted all indexd data, updated to 0.8.2, lowered RPCBATCHSIZE in indexd app from 500 to 200, and increased rpcthreads to 8 (my number of cores). With dbcache 8G and rpcworkqueue depth of 12 (both indexd and bitcoin), that makes this particular system lightly loaded and I'm saw lost messages less frequently. I should run it for longer, but in all likelihood "less aggressive settings = fewer timeouts" seems reasonable. I'll do one more run with rpcworkqueue depth of 8 (I'll leave bitcoind at 12) and RPC batch size of 100 and edit this comment later.

Did 2 more "from scratch" tests; the first with 32 RPC batch size and 4 concurrency and the second with 100 and 8, respectively. I got fewer instances of lost messages in the first case.

root@indexd:~# cat indexd-batch32-r4.txt | grep lost | wc -l
13
root@indexd:~# cat indexd-batch100-r8.txt | grep lost | wc -l
26

root@indexd:~# cat indexd-batch32-r4.txt | grep lost
Wed, 01 Nov 2017 13:24:31 GMT zmq -6416 messages lost
Wed, 01 Nov 2017 13:24:37 GMT zmq 6415 messages lost
Wed, 01 Nov 2017 13:44:34 GMT zmq -6605 messages lost
Wed, 01 Nov 2017 13:44:38 GMT zmq 6604 messages lost
Wed, 01 Nov 2017 14:04:41 GMT zmq -6768 messages lost
Wed, 01 Nov 2017 14:04:43 GMT zmq 6767 messages lost
Wed, 01 Nov 2017 14:24:43 GMT zmq -6965 messages lost
Wed, 01 Nov 2017 14:24:46 GMT zmq 6964 messages lost
Wed, 01 Nov 2017 14:45:04 GMT zmq 216 messages lost
Wed, 01 Nov 2017 15:04:52 GMT zmq -8274 messages lost
Wed, 01 Nov 2017 15:04:54 GMT zmq 8273 messages lost
Wed, 01 Nov 2017 15:24:56 GMT zmq -8462 messages lost
Wed, 01 Nov 2017 15:25:06 GMT zmq 8461 messages lost

root@indexd:~# cat indexd-batch100-r8.txt | grep lost
Wed, 01 Nov 2017 09:03:39 GMT zmq -2319 messages lost
Wed, 01 Nov 2017 09:03:41 GMT zmq 2318 messages lost
Wed, 01 Nov 2017 09:23:43 GMT zmq -2737 messages lost
Wed, 01 Nov 2017 09:23:47 GMT zmq 2736 messages lost
Wed, 01 Nov 2017 09:43:46 GMT zmq -3351 messages lost
Wed, 01 Nov 2017 09:43:50 GMT zmq 3350 messages lost
Wed, 01 Nov 2017 10:03:52 GMT zmq -3543 messages lost
Wed, 01 Nov 2017 10:03:57 GMT zmq 3542 messages lost
Wed, 01 Nov 2017 10:23:57 GMT zmq -3820 messages lost
Wed, 01 Nov 2017 10:24:00 GMT zmq 3819 messages lost
Wed, 01 Nov 2017 10:44:03 GMT zmq -4286 messages lost
Wed, 01 Nov 2017 10:44:07 GMT zmq 4285 messages lost
Wed, 01 Nov 2017 11:04:07 GMT zmq -4536 messages lost
Wed, 01 Nov 2017 11:04:12 GMT zmq 4535 messages lost
Wed, 01 Nov 2017 11:24:10 GMT zmq -4691 messages lost
Wed, 01 Nov 2017 11:24:26 GMT zmq 4690 messages lost
Wed, 01 Nov 2017 11:44:13 GMT zmq -4995 messages lost
Wed, 01 Nov 2017 11:44:16 GMT zmq 4994 messages lost
Wed, 01 Nov 2017 12:04:17 GMT zmq -5283 messages lost
Wed, 01 Nov 2017 12:04:23 GMT zmq 5282 messages lost
Wed, 01 Nov 2017 12:24:20 GMT zmq -5563 messages lost
Wed, 01 Nov 2017 12:24:21 GMT zmq 5562 messages lost
Wed, 01 Nov 2017 12:44:22 GMT zmq -5788 messages lost
Wed, 01 Nov 2017 12:44:41 GMT zmq 5787 messages lost
Wed, 01 Nov 2017 13:04:28 GMT zmq -6130 messages lost
Wed, 01 Nov 2017 13:04:32 GMT zmq 6129 messages lost

unsystemizer commented 6 years ago

Mystery solved, it seems. Well, at least on my system. This is with indexd 0.8.2 and bitcoind 0.15.1rc1, but I've been seeing this all along. Luckily, not a big deal.

bitcoinjs / indexd

Negative number of lost ZMQ messages #21