basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.92k stars 534 forks source link

Atom table limit hit if `riak admin` called regularly #1066

Open Bob-The-Marauder opened 3 years ago

Bob-The-Marauder commented 3 years ago

One of our customers found an issue with KV 3.0.3 where the atom table kept becoming exhausted if riak admin is called regularly e.g. polling riak admin status for monitoring purposes. This was traced to a problem with relx in pre-OTP23 builds. We have filed the following PR https://github.com/erlware/relx/pull/868

Here is a brief example where the atom count increases.

[root@localhost riak]# riak start
[root@localhost riak]# riak attach
Attaching to /tmp/erl_pipes/riak@127.0.0.1/erlang.pipe.1 (^D to exit)

(riak@127.0.0.1)1> erlang:system_info(atom_count).
52654
(riak@127.0.0.1)2> [Quit]
[root@localhost riak]# riak admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) riak@127.0.0.1 |valid |  up   |100.0|  --   |
+--------------------+------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
[root@localhost riak]# riak attach
Attaching to /tmp/erl_pipes/riak@127.0.0.1/erlang.pipe.1 (^D to exit)

(riak@127.0.0.1)2> erlang:system_info(atom_count).
52656

Although such a small increment should not really cause any issues, when riak admin status is polled regularly 24 hours/day, it slowly adds up until you finally hit the 1 million atom mark and Riak crashes. Current work around is to restart Riak before the atom count gets too high.

martincox commented 3 years ago

Ahh interesting. I'm sure I've heard this same problem talked about before.

Sounds like it might be something along the lines of using list_to_atom/1 when creating a random maint shell name, which I think would occur everytime riak admin is called.

martincox commented 3 years ago

Ignore me, didn't read properly - see that it's already been dug into and the guilty code found and fixed.

Bob-The-Marauder commented 3 years ago

We made these changes locally and, although there does seem to be some improvement, it does not fix the issue. We're currently trying to find the source of the issue.