A study of different Longest Common Sub-sequence algorithms with optimizations.
This project accomplishes two different goals:
lcs
module)lcs-algorithm
module)An LCS algorithm is a procedure to recognize the Longest Common Subsequence from two given sequences. The result of the calculation can be any of these:
Notice that the LCS is not unique, there could be many different sub-sequences of the same (maximum) size.
The module containing the optimized algorithms will be soon released on maven central.
The project is divided in 4 (maven) modules:
lcs-helper
defines some common interfaceslcs-test-util
contains a test suite to be sure that every
algorithm is correct.lcs-algorithm
contains the implementation of many different
LCS algorithms. There is also a performance test to compare them
under different conditions.lcs
contains performance-wise implementations that can be imported
and used separately from the other modules of the project.The algorithms included are (along with some variants and optimizations):
There is also a very fast implementation of the Levenshtein distance algorithm.
Different algorithms perform differently depending on the input. Myers' algorithms perform very well when the sequences are similar (shares the majority of the elements) but their memory usage and execution time grows dramatically when the sequences differ. The scoretable algorithms always execute the same number of operations independently from the similarity of the input sequences and they are particularly performant when the sequences are small. The Hirschberg algorithm is a linear space improvement over the Wagner-Fischer algorithm which has quadratic space usage. It's slower than the Wagner-Fischer but a good choice if the sequences are different and long.
All the presented tests are executed with [AlgorithmPerformanceTest] (lcs-algorithms/src/test/java/com/fillumina/lcs/algorithm/performance/AlgorithmsPerformanceTest.java).
TOTAL=600
, LCS=400
)SmithWaterman 0 : 1.97 ms 74.90 %
WagnerFischer 1 : 2.14 ms 81.50 %
OptimizedHirschbergLinearSpace 2 : 2.63 ms 100.00 %
OptimizedMyersLcs 3 : 0.85 ms 32.56 %
OptimizedLinearSpaceLcs 4 : 1.00 ms 38.28 %
When the sequences are similar and not very long the Myers algorithm is the clear winner. Note that the linear space variant isn't that slower. The performance of the scoretable algorithms doesn't depend on the similarities.
TOTAL=6000
, LCS=4000
)SmithWaterman 0 : 251.23 ms 61.32 %
WagnerFischer 1 : 409.72 ms 100.00 %
OptimizedHirschbergLinearSpace 2 : 250.73 ms 61.20 %
OptimizedMyersLcs 3 : 89.51 ms 21.85 %
OptimizedLinearSpaceLcs 4 : 92.15 ms 22.49 %
When the length of the sequences grows the memory access become more costly and the performances of the Hirschberg algorithm starts to improve in respect to the scoretable algorithms. Myers algorithms are still the best but the performances of the linear space algorithm is starting to pair with those of the simple algorithm. For longer sequences the simple Myers algorithm cannot be used because of the enormous amount of memory required.
TOTAL=600
, LCS=4
)SmithWaterman 0 : 2.43 ms 36.44 %
WagnerFischer 1 : 2.22 ms 33.35 %
OptimizedHirschbergLinearSpace 2 : 2.76 ms 41.30 %
OptimizedMyersLcs 3 : 6.02 ms 90.11 %
OptimizedLinearSpaceLcs 4 : 6.68 ms 100.00 %
When the sequences are different (share very few elements) the results change dramatically: the scoretable algorithms (Smith-Waterman and Wagner-Fischer) stay stable while the Myers algorithms become much slower.
TOTAL=6000
, LCS=40
)SmithWaterman 0 : 218.84 ms 31.72 %
WagnerFischer 1 : 293.51 ms 42.54 %
OptimizedHirschbergLinearSpace 2 : 238.50 ms 34.57 %
OptimizedMyersLcs 3 : 601.26 ms 87.15 %
OptimizedLinearSpaceLcs 4 : 689.91 ms 100.00 %
When length of the sequences increase the Hirshberg algorithm (which uses linear space) has an edge over the other scoretable algorithms (Smith-Waterman and Wagner-Fischer). For longer different sequences is the clear winner.
TOTAL=30
, LCS=2
)SmithWaterman 0 : 6.17 us 40.80 %
WagnerFischer 1 : 5.39 us 35.68 %
OptimizedHirschbergLinearSpace 2 : 7.95 us 52.56 %
OptimizedMyersLcs 3 : 14.82 us 98.00 %
OptimizedLinearSpaceLcs 4 : 15.13 us 100.00 %
In case of small different sequences the Wagner-Fischer algorithm performs very well because most of its (very simple) calculations are performed on the cache.
TOTAL=30
, LCS=20
)SmithWaterman 0 : 6.28 us 73.83 %
WagnerFischer 1 : 5.72 us 67.23 %
OptimizedHirschbergLinearSpace 2 : 8.51 us 100.00 %
OptimizedMyersLcs 3 : 3.63 us 42.69 %
OptimizedLinearSpaceLcs 4 : 4.52 us 53.18 %
The Myers algorithm is always first when the sequences are similar.
TOTAL=60
, LCS=40
)SmithWaterman 0 : 21.51 us 72.52 %
WagnerFischer 1 : 19.29 us 65.05 %
OptimizedHirschbergLinearSpace 2 : 29.66 us 100.00 %
OptimizedMyersLcs 3 : 11.40 us 38.44 %
OptimizedLinearSpaceLcs 4 : 13.78 us 46.47 %
The linear space Myers algorithm perform a little better now than with smaller sequences.
TOTAL=60
, LCS=4
)SmithWaterman 0 : 22.92 us 33.70 %
WagnerFischer 1 : 19.14 us 28.14 %
OptimizedHirschbergLinearSpace 2 : 28.23 us 41.50 %
OptimizedMyersLcs 3 : 57.46 us 84.45 %
OptimizedLinearSpaceLcs 4 : 68.03 us 100.00 %
few elements (<50) | many elements | |
---|---|---|
similar | Myers | Linear Space Myers |
different | Wagner-Fischer | Hirschberg |