basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.94k stars 536 forks source link

Develop 2.9 #957

Closed martinsumner closed 5 years ago

martinsumner commented 5 years ago

Leveled issue. Under significant handoff load, iclerk might fail due to a timeout on get_positions - and this would result in frequent vnode crashes.

martinsumner commented 5 years ago

The timeouts seen in the iclerk looks like they were related to memory exhaustion. As part of RC2 the way a level 0 SST file would fetch its data changes, meaning the data had to be kept on the loop state in the starting state. In the reader state - this data was removed from the loop state.

L0 files are normally short-lived. However in handoff, they are more likely to be switched rather than merged to clear L0. In this case, the L0 sst file can become long-lived. The file occupied about 50M memory, more than 95% of which was uncollected garbage (the now deleted data from the loop state).

This could lead to a rapid accumulation of memory in the beam.

To avoid this, on switching a level zero file, the file now calls garbage_collect(self()). This leads to a more smoother growth in riak memory - and means pauses related to GC in the future (and hence timeouts on function calls) are less likely.