Open keith-turner opened 2 months ago
and probably would be suitable to do in 2.1
Did you mean "would not be"?
@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.
@keith-turner - you might be thinking of https://github.com/apache/accumulo/pull/4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.
That change could speed up log sorting. The problem in this issue happens after the logs are sorted and when tablets w/ sorted walogs are loaded on a tablet server. Tablet severs only load one tablet w/ walogs at time which is what makes things slow.
Is your feature request related to a problem? Please describe.
Write ahead log recovery can take a while because of the following two behaviors.
Those behaviors make log recovery times correlate with the number of tablets per tserver. So as the number of tablets per tserver increases, log recovery time increases.
Describe the solution you'd like
Allow parallel log recovery and faster log recovery. The parallelism is related to #4429, but that change does not completely solve the issue as the lock is still acquired for log recovery.
Describe alternatives you've considered
Could potentially produce an F file for log recovery outside of the tablet server somewhere (similar to external compactions). This may have been discussed on an elasticity related issue, but could not find it. This would be a much larger change and probably would be suitable to do in 2.1. It may require completly refactoring the tablet minor compaction code to make it usable elsewhere.