Closed martinsumner closed 5 years ago
The timeouts seen in the iclerk looks like they were related to memory exhaustion. As part of RC2 the way a level 0 SST file would fetch its data changes, meaning the data had to be kept on the loop state in the starting
state. In the reader
state - this data was removed from the loop state.
L0 files are normally short-lived. However in handoff, they are more likely to be switched rather than merged to clear L0. In this case, the L0 sst file can become long-lived. The file occupied about 50M memory, more than 95% of which was uncollected garbage (the now deleted data from the loop state).
This could lead to a rapid accumulation of memory in the beam.
To avoid this, on switching a level zero file, the file now calls garbage_collect(self()). This leads to a more smoother growth in riak memory - and means pauses related to GC in the future (and hence timeouts on function calls) are less likely.
Leveled issue. Under significant handoff load, iclerk might fail due to a timeout on get_positions - and this would result in frequent vnode crashes.