basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.95k stars 537 forks source link

Getting error "Unknown message code: 11" during get object from Riak [JIRA: RIAK-2941] #879

Closed AliakseiMat closed 8 years ago

AliakseiMat commented 8 years ago

I got the following log during getting object from Riak: com.basho.riak.client.core.netty.RiakResponseException: Unknown message code: 11 at com.basho.riak.client.core.netty.RiakResponseHandler.channelRead(RiakResponseHandler.java:52) ~[com.basho.riak.riak-client-2.0.5.jar:na] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:318) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:304) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:276) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:318) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:304) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[io.netty.netty-all-4.0.33.Final.jar:4.0.33.Final] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_101]

What does it mean "Unknown message code: 11" Thanks!

lukebakken commented 8 years ago

Hello -

I have opened this issue for tracking, please follow up here: basho/riak-java-client#681

sobomax commented 2 years ago

We are still seeing this once in a while.

martinsumner commented 2 years ago

Are there any errors in the riak logs that occur at the same time?

I'm not sure what the message code 11 is referring to. A PUT request uses message code 11 in the PB API, perhaps it might just be saying the error was in response to a PUT request? I'm not familiar with the Java client, but perhaps we can make sense of what is happening if we can tie the error to a server-side log

sobomax commented 2 years ago

Hi @martinsumner thanks for looking into this. Basically we are using python client, not java. And the conditions when this error happens is that the Riak daemon has been just freshly started and we try to push some objects in. Here is a full stack trace when the error happens:

2022-01-24 21:54:22.882508 SQLTransactionManager: unhandled exception when processing query results
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ssp/sippy/SQLTransactionManager.py", line 323, in dispatch
    res_cb(ress, eps, *callback_parameters)
  File "/home/ssp/scripts/replication/db_bouncer.py", line 266, in gotres1
    retrieve_cb(None, *args)
  File "/home/ssp/scripts/replication/db_bouncer.py", line 438, in retrieve_slave_done
    self.retrieve_done()
  File "/home/ssp/scripts/replication/db_bouncer.py", line 572, in retrieve_done
    self.elected(update_node, select_node, dbsplit)
  File "/home/ssp/scripts/replication/db_bouncer.py", line 379, in elected
    self.on_elected_cb(self.elected2, update_node, select_node, dbsplit)
  File "/home/ssp/scripts/replication/db_bouncer.py", line 305, in on_elected
    cb(*cb_args)
  File "/home/ssp/scripts/replication/db_bouncer.py", line 424, in elected2
    self.riak_client.store_pyo(F'DBProxyParams_{self.conf.db_name}', self.conf)
  File "/home/ssp/lib/python/ssp/riak/Python2Riak.py", line 38, in <lambda>
    self.store_pyo = lambda k, v: self._store_pyo(bucketo, k, v)
  File "/home/ssp/lib/python/ssp/riak/Python2Riak.py", line 33, in _store_pyo
    obj.store()
  File "/usr/local/lib/python3.6/site-packages/riak/riak_object.py", line 267, in store
    timeout=timeout)
  File "/usr/local/lib/python3.6/site-packages/riak/client/transport.py", line 223, in wrapper
    return self._with_retries(pool, thunk)
  File "/usr/local/lib/python3.6/site-packages/riak/client/transport.py", line 153, in _with_retries
    return fn(transport)
  File "/usr/local/lib/python3.6/site-packages/riak/client/transport.py", line 221, in thunk
    return fn(self, transport, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/riak/client/operations.py", line 537, in put
    timeout=timeout)
  File "/usr/local/lib/python3.6/site-packages/riak/transports/tcp/transport.py", line 154, in put
    resp_code, resp = self._request(msg, codec)
  File "/usr/local/lib/python3.6/site-packages/riak/transports/tcp/transport.py", line 551, in _request
    codec.maybe_riak_error(resp_code, data)
  File "/usr/local/lib/python3.6/site-packages/riak/codecs/__init__.py", line 28, in maybe_riak_error
    raise RiakError(bytes_to_str(err.errmsg))
riak.riak_error.RiakError: 'Unknown message code: 11'

Riak server logs at that point looks like the following (console.log):

2022-01-24 21:54:18.597 [info] <0.7.0> Application runtime_tools started on node 'riak@node-c49e1446.'
2022-01-24 21:54:18.691 [info] <0.7.0> Application erlang_js started on node 'riak@node-c49e1446.'
2022-01-24 21:54:18.691 [info] <0.7.0> Application xmerl started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.033 [info] <0.7.0> Application inets started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.033 [info] <0.7.0> Application mochiweb started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.221 [info] <0.7.0> Application webmachine started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.222 [info] <0.7.0> Application basho_stats started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.250 [info] <0.7.0> Application bitcask started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.304 [info] <0.7.0> Application clique started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.305 [info] <0.7.0> Application eleveldb started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.305 [info] <0.7.0> Application pbkdf2 started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.305 [info] <0.7.0> Application poolboy started on node 'riak@node-c49e1446.'
2022-01-24 21:54:19.464 [info] <0.177.0>@exometer_report:do_start_reporters:613 Starting reporters with []
2022-01-24 21:54:19.464 [info] <0.7.0> Application exometer_core started on node 'riak@node-c49e1446.'
2022-01-24 21:54:20.873 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,vnode_routing} = proxy
2022-01-24 21:54:20.961 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,staged_joins} = true
2022-01-24 21:54:21.142 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,resizable_ring} = true
2022-01-24 21:54:21.226 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,fold_req_version} = v2
2022-01-24 21:54:21.244 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,security} = true
2022-01-24 21:54:21.276 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,bucket_types} = true
2022-01-24 21:54:21.362 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_core,net_ticktime} = true
2022-01-24 21:54:21.997 [info] <0.7.0> Application riak_core started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.029 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_pipe,trace_format} = ordsets
2022-01-24 21:54:22.124 [info] <0.7.0> Application riak_pipe started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.144 [info] <0.7.0> Application sidejob started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.144 [info] <0.7.0> Application riak_dt started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.144 [info] <0.7.0> Application protobuffs started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.145 [info] <0.7.0> Application riak_pb started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.437 [info] <0.7.0> Application riak_api started on node 'riak@node-c49e1446.'
2022-01-24 21:54:22.703 [info] <0.365.0>@riak_kv_env:doc_env:46 Environment and OS variables:
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: Open file limit: 20000
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: No core size limit
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: Erlang process limit: 262144
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: Erlang ports limit: 262144
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: ETS table count limit: 256000
2022-01-24 21:54:22.718 [info] <0.365.0> riak_kv_env: Thread pool size: 64
2022-01-24 21:54:22.826 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,object_hash_version} = 0
2022-01-24 21:54:22.836 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,anti_entropy} = enabled_v1
2022-01-24 21:54:22.847 [info] <0.389.0>@riak_core_throttle:enable_throttle:119 Enabling throttle for riak_kv/aae_throttle.
2022-01-24 21:54:23.020 [warning] <0.6.0> lager_error_logger_h dropped 1 messages in the last second that exceeded the limit of 100 messages/sec
2022-01-24 21:54:23.031 [info] <0.396.0>@riak_kv_ttaaefs_manager:init:263 Initiated Tictac AAE Full-Sync Mgr with scope=disabled
2022-01-24 21:54:23.073 [info] <0.401.0>@riak_core:wait_for_service:527 Waiting for service riak_kv to start (0 seconds)
2022-01-24 21:54:23.187 [info] <0.420.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.420.0>)
2022-01-24 21:54:23.192 [info] <0.421.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.421.0>)
2022-01-24 21:54:23.197 [info] <0.422.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.422.0>)
2022-01-24 21:54:23.202 [info] <0.423.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.423.0>)
2022-01-24 21:54:23.207 [info] <0.424.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.424.0>)
2022-01-24 21:54:23.212 [info] <0.425.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.425.0>)
2022-01-24 21:54:23.217 [info] <0.426.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.426.0>)
2022-01-24 21:54:23.222 [info] <0.427.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.427.0>)
2022-01-24 21:54:23.227 [info] <0.429.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.429.0>)
2022-01-24 21:54:23.231 [info] <0.430.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.430.0>)
2022-01-24 21:54:23.236 [info] <0.431.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.431.0>)
2022-01-24 21:54:23.240 [info] <0.432.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.432.0>)
2022-01-24 21:54:23.244 [info] <0.433.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.433.0>)
2022-01-24 21:54:23.249 [info] <0.434.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.434.0>)
2022-01-24 21:54:23.254 [info] <0.436.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host starting (<0.436.0>)
2022-01-24 21:54:23.258 [info] <0.437.0>@riak_kv_js_vm:init:73 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host starting (<0.437.0>)
2022-01-24 21:54:23.267 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,vnode_vclocks} = true
2022-01-24 21:54:23.277 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,legacy_keylisting} = false
2022-01-24 21:54:23.285 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,listkeys_backpressure} = true
2022-01-24 21:54:23.292 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,index_backpressure} = true
2022-01-24 21:54:23.299 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,mapred_system} = pipe
2022-01-24 21:54:23.306 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,mapred_2i_pipe} = true
2022-01-24 21:54:23.313 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,handoff_data_encoding} = encode_raw
2022-01-24 21:54:23.320 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,object_format} = v1
2022-01-24 21:54:23.328 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,secondary_index_version} = v3
2022-01-24 21:54:23.335 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,vclock_data_encoding} = encode_zlib
2022-01-24 21:54:23.342 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,crdt} = [pncounter,riak_dt_pncounter,riak_dt_orswot,riak_dt_map,riak_kv_hll,riak_dt_gset]
2022-01-24 21:54:23.350 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,crdt_epoch_versions} = [{riak_dt_map,2},{riak_dt_orswot,2},{riak_dt_pncounter,2},{riak_kv_hll,1},{riak_dt_gset,2}]
2022-01-24 21:54:23.357 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,put_fsm_ack_execute} = enabled
2022-01-24 21:54:23.365 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,get_request_type} = head
2022-01-24 21:54:23.372 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,put_soft_limit} = true

erlang.log:

===== Mon Jan 24 21:49:58 PST 2022

=====
===== LOGGING STARTED Mon Jan 24 21:54:13 PST 2022
=====
config is OK
-config /usr/local/lib/riak/data/generated.configs/app.2022.01.24.21.54.14.config -args_file /usr/local/lib/riak/data/generated.configs/vm.2022.01.24.21.54.14.args -vm_args /usr/local/lib/riak/data/generated.configs/vm.2022.01.24.21.54.14.args
!!!!
!!!! WARNING: ulimit -n is 20000; 65536 is the recommended minimum.
!!!!
Exec:  /usr/local/lib/riak/bin/../erts-5.10.3/bin/erlexec -boot /usr/local/lib/riak/bin/../releases/2.9.9/riak               -config /usr/local/lib/riak/data/generated.configs/app.2022.01.24.21.54.14.config -args_file /usr/local/lib/riak/data/generated.configs/vm.2022.01.24.21.54.14.args -vm_args /usr/local/lib/riak/data/generated.configs/vm.2022.01.24.21.54.14.args              -pa /usr/local/lib/riak/bin/../lib/basho-patches -- console
Root: /usr/local/lib/riak/bin/..
Erlang R16B02_basho10 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true] [frame-pointer]

Eshell V5.10.3  (abort with ^G)
(riak@node-c49e1446.)1>

The condition seems to persist only very temporarily, we are hitting this error once for like 10-15 runs. The workaround I see on our end would be to catch this error and retry the operation after some small back-off interval, but perhaps more elegant solution exists to delay processing requests on the server side until it fully initialized? Thanks!

sobomax commented 2 years ago

P.S. Based on the timestamps, we are hitting the Riak somewhere along those times:

2022-01-24 21:54:22.826 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,object_hash_version} = 0
2022-01-24 21:54:22.836 [info] <0.225.0>@riak_core_capability:process_capability_changes:555 New capability: {riak_kv,anti_entropy} = enabled_v1
2022-01-24 21:54:22.847 [info] <0.389.0>@riak_core_throttle:enable_throttle:119 Enabling throttle for riak_kv/aae_throttle.
2022-01-24 21:54:23.020 [warning] <0.6.0> lager_error_logger_h dropped 1 messages in the last second that exceeded the limit of 100 messages/sec
2022-01-24 21:54:23.031 [info] <0.396.0>@riak_kv_ttaaefs_manager:init:263 Initiated Tictac AAE Full-Sync Mgr with scope=disabled
2022-01-24 21:54:23.073 [info] <0.401.0>@riak_core:wait_for_service:527 Waiting for service riak_kv to start (0 seconds)
sobomax commented 2 years ago

P.P.S. I think this message might have something to do with the issue:

/var/log/riak/console.log.0:2022-01-25 07:27:06.296 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 07:39:31.159 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 07:41:20.237 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 10:14:12.027 [warning] <0.6.0> lager_error_logger_h dropped 12 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 10:24:59.042 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 10:39:22.137 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 10:52:36.214 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 12:04:28.288 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 12:28:59.003 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 12:30:37.006 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 12:32:52.825 [warning] <0.6.0> lager_error_logger_h dropped 2 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 13:28:54.008 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 17:01:30.064 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 17:01:40.002 [warning] <0.6.0> lager_error_logger_h dropped 15 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-25 20:22:59.224 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 07:56:49.219 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 07:56:58.004 [warning] <0.6.0> lager_error_logger_h dropped 13 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 08:25:15.117 [warning] <0.6.0> lager_error_logger_h dropped 1 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 08:27:12.757 [warning] <0.6.0> lager_error_logger_h dropped 15 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 11:07:24.181 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 11:27:11.253 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 12:09:53.112 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 12:11:22.262 [warning] <0.6.0> lager_error_logger_h dropped 3 messages in the last second that exceeded the limit of 100 messages/sec
/var/log/riak/console.log.0:2022-01-26 13:10:13.097 [warning] <0.6.0> lager_error_logger_h dropped 1 messages in the last second that exceeded the limit of 100 messages/sec

The timestamps and number of errors correlate pretty well with what we are seeing on client end:


/var/log/db_bouncer.log.0.bz2-1: 2022-01-25 07:41:20: cdrs: DoubleTCPProxy[-9223372034704545518]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-25 07:41:20: balances: DoubleTCPProxy[2151440290]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-25 07:41:21: sippy: DoubleTCPProxy[-9223372034703331580]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 06:52:36: cdrs: DoubleTCPProxy[2150381302]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 06:52:36: sippy: DoubleTCPProxy[2150388495]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 06:52:36: balances: DoubleTCPProxy[-9223372034704387219]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:25:14: sippy: DoubleTCPProxy[-9223372034703979099]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:25:14: cdrs: DoubleTCPProxy[-9223372034703979134]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:25:14: balances: DoubleTCPProxy[2150803790]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:59:12: cdrs: DoubleTCPProxy[2150330018]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:59:12: balances: DoubleTCPProxy[2150386622]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 08:59:12: sippy: DoubleTCPProxy[-9223372034704384992]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 09:59:22: cdrs: DoubleTCPProxy[2150380208]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 09:59:22: sippy: DoubleTCPProxy[-9223372034704392255]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 09:59:22: balances: DoubleTCPProxy[2150391076]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 11:27:11: sippy: DoubleTCPProxy[2150381765]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 11:27:11: balances: DoubleTCPProxy[2150388011]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 11:27:11: cdrs: DoubleTCPProxy[-9223372034704387703]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 12:11:22: sippy: DoubleTCPProxy[-9223372034704132064]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 12:11:22: balances: DoubleTCPProxy[-9223372034704131966]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-26 12:11:22: cdrs: DoubleTCPProxy[2150651442]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-27 02:31:43: cdrs: DoubleTCPProxy[-9223372034703966541]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-27 02:31:43: sippy: DoubleTCPProxy[-9223372034703966478]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
/var/log/db_bouncer.log.0.bz2-1: 2022-01-27 02:31:43: balances: DoubleTCPProxy[-9223372034703959706]: second proxy started
/var/log/db_bouncer.log.0.bz2:'Unknown message code: 11'
martinsumner commented 2 years ago

To startup riak_kv, a lot of things need to happen in order - the function that starts things up is start/2 in https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_app.erl.

The bulk of the work is in https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_app.erl#L145 - where all the supervised processes are started (via https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_sup.erl#L42). It then starts to register capabilities. The very last thing it does is to register the API so it can handle PB/HTTP requests - https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_app.erl#L256-L261.

So I suspect that although riak has commenced the start process, and that may include TCP ports becoming active - if the API has not been registered yet you may get a strange error code.

I don't think there's a neat solution, within Riak. Whilst Riak is waiting to complete startup it might start to TCP LISTEN before it is ready to service requests - so if you have a load balancer that is just running a SYN check on the TCP port, Riak may seem prematurely available. A deeper check of availability would be required, or some retry logic as you suggest.

sobomax commented 2 years ago

@martinsumner few things I can comment on:

  1. Some nicer exception on the python/java/whatever side is probably needed if this turns out to be long-standing architectural issue.
  2. There might be simply something fishy with the limit checks when Riak has just started and lots of requests come in in a rush (i.e. system is booting up and various sub-systems are pushing keys on/out. You might consider some kind of grace period where that limit enforcement is softened or disabled altogether.
sobomax commented 2 years ago

Martin, the follow-up question I have is that what kind of error are we going to see on the client side if we exceed this limit during normal operational phase? Is it going to be "Unknown message code: 11" or something else more reasonable? I am on the road right now, but that should be easy to test once I am back online.

sobomax commented 2 years ago

P.S. That all being said, we get "startup ok" (which I believe entails sending "ping" API call to Riak) on console before we try to push anything in, so at least theoretically I don't see any reasons why that could not be delayed until all internal subsystems are full up and running. Of course if we still hit Riak before that all bets could be off as you said, but this is not the condition we are seeing this particular error with.

martinsumner commented 2 years ago

Looking into when the start script returns startup ok, the script is generated by [relx](https://github.com/erlware/relx - this is in Riak >= 3.0, a basho-specific solution is used < 3.0), and we configure relx just to wait for riak_core_node_watcher.

riak_core is loaded prior to riak_kv, so at this stage Riak is not yet usable.

We could configure to wait for the riak_kv service in relx, instead of just the node watcher. However, there are some cases, particularly when individual vnodes are repairing, when that may take a very long time - and causing the console to hang for this long time might not be the correct solution.

Different language clients have ended up with different workarounds to this (calling APIs to confirm the node is ready), but these workarounds have had some issues (I know there are some issues with the latest python client getting incorrect capabilities at startup when talking to an uninitiated node), and aren't effective when a load-balancer is used - as the client doesn't know it is establishing a connection to a new server.

Both the HTTP and PB APIs support a 'pingrequest. This is different to theriak ping` at the API, which is using relx under the hood (and just checking the erlang node is up). The HTTP/PB ping API won't work until the API services have started - and I assume at leats the HTTP API ping could be used as a healthcheck by a load-balancer.

The obscure/confusing "Unknown message code: 11" might be hard to do to improve on the Riak side. I will have a look tomorrow. I suspect it is the underlying protocol buffer code that generates the message, as we haven't registered the message codes (including 11 for a PUT request) in the startup path yet. Registering the services earlier will just lead to a lot of other nasty errors related to dependent parts not being started. There's not an easy switch to flick to give something more meaningful in either the logs or the response.

It is worth noting that most of the clients are no longer actively developed. When Riak went open source we set the bar of any new release to have updated erlang clients for testing, but not to try and maintain all the other clients. At the time, most of the big users were in effect rolling their own clients.

Apologies if this all sounds a bit unhelpful and apathetic, but it is difficult to improve the eco-system at times, especially as we try and maintain backwards compatibility.