IlyaGrebnov / libsais

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.
Apache License 2.0
180 stars 22 forks source link

Why not compute the LCP Array during the Suffix Array's construction ? #24

Closed Epitax closed 2 months ago

Epitax commented 2 months ago

Currently the path to computing the LCP Aray with libsais is: libsais(T,...) ==> libsais_plcp(T, SA,...) ==> libsais_lcp(PLCP, SA,...)

This takes an unnecessary amount of operations since the LCP Array can be computed during the Suffix Array's construction thus avoiding the recomparison of entire adjacent overlapping strings in the Suffix Array:

This can be done because whenever we place two S- or L-suffixes Si−1 and Sj−1 at adjacent places k−1 and k in the final Suffix Array (steps 3 and 4 in the SAIS algorithm), the length of their longest common prefix can be induced from the longest common prefix of the suffixes Si and Sj. Since the latter suffixes are exactly those that caused the inducing of Si−1 and Sj−1, we already know their LCP-value ℓ (by the order in which we fill the SA), and can hence set LCP[k] to ℓ + 1.

IlyaGrebnov commented 2 months ago

Based on my testing, if you only need the Permuted Longest Common Prefix (PLCP) array, Kasai's algorithm with the right optimizations is still faster.

Epitax commented 2 months ago

The subject of this issue is the computation of the LCP (not of the PLCP).

If only the LCP is needed, then computing it from the PLCP is not optimal. LCP can be computed without computing the PLCP first ...and it can be done in one go, during the SA's construction.

Epitax commented 2 months ago

Why did you close this issue without first responding to my observation that the subject of this issue is the computation of the LCP only (not of the PLCP) ?

IlyaGrebnov commented 2 months ago

I added support for PLCP/LCP while working on esa-matchfinder. This project only requires PLCP computation, and based on my testing, Kasai's algorithm with the right optimizations is still faster. Additionally, it is much simpler to parallelize.

I added PLCP to LCP conversion to complete the functionality with optimizations. The one sub-optimal part is that PLCP to LCP conversion requires additional memory. I agree with this limitation, and if you have the time to contribute changes, please feel free to submit a pull request. I would be happy to review it. That being said, I am not convinced that computing LCP during induction is a better approach. Induction is very data-dependent and will likely cause stalls, affecting IPC.