apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.05k stars 444 forks source link

Improve log recovery times #4505

Open keith-turner opened 2 months ago

keith-turner commented 2 months ago

Is your feature request related to a problem? Please describe.

Write ahead log recovery can take a while because of the following two behaviors.

Those behaviors make log recovery times correlate with the number of tablets per tserver. So as the number of tablets per tserver increases, log recovery time increases.

Describe the solution you'd like

Allow parallel log recovery and faster log recovery. The parallelism is related to #4429, but that change does not completely solve the issue as the lock is still acquired for log recovery.

Describe alternatives you've considered

Could potentially produce an F file for log recovery outside of the tablet server somewhere (similar to external compactions). This may have been discussed on an elasticity related issue, but could not find it. This would be a much larger change and probably would be suitable to do in 2.1. It may require completly refactoring the tablet minor compaction code to make it usable elsewhere.

ctubbsii commented 2 months ago

and probably would be suitable to do in 2.1

Did you mean "would not be"?

dlmarion commented 2 months ago

@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

keith-turner commented 2 months ago

@keith-turner - you might be thinking of https://github.com/apache/accumulo/pull/4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

That change could speed up log sorting. The problem in this issue happens after the logs are sorted and when tablets w/ sorted walogs are loaded on a tablet server. Tablet severs only load one tablet w/ walogs at time which is what makes things slow.