Closed dhowe closed 2 months ago
Hi Daniel, perhaps the simplest solution is to concatenate your input set ["if", "it", "is", "true", "it", "is", "true"]
into a single string S[0,n-1]
, mark positions you are interested (for words, the initial symbol) with a bitvector B[0,n-1]
, such as:
S = if it is true it is true
B = 100100100100001001001000
Then compute the suffix array and scan it to "extract" your output,
j = 0;
for i=0, 1,..., n-1{
if(B[SA[i]] == 1)
SA'[j++] = rank_1(SA[i]);
}
Gotcha?
Doesn't this involve extra work (and extra space) to compute all the suffixes for the entire string, when all we need are a subset and we ignore the rest ? Using the same example:
[
e,
eitistrue,
fitistrueitistrue,
ifitistrueitistrue,
istrue,
istrueitistrue,
itistrue,
itistrueitistrue,
rue,
rueitistrue,
strue,
strueitistrue,
tistrue,
tistrueitistrue,
true,
trueitistrue,
ue,
ueitistrue
]
VS
[
if it is true it is true
is true
is true it is true
it is true
it is true it is true
true
true it is true
]
Yes, it does. The running time is still linear to the input size.
I need to build a suffix array on multi-character tokens (basically 'words') rather than characters. In my first attempt I created a sorted 'alphabet' of unique tokens from the input, and assigned each an integer, then passed the string of these integer indexes, joined with a separator into the algorithm. This works, but require lots of string joins/splits, and there are additional suffixes computed that start with the separator which is not ideal.
I'm wondering if there a better way to do this, perhaps computing the SA directly on the array rather than on a string ??
A toy example given the input array
[ "if", "it", "is", "true", "it", "is", "true"]
.moved from: https://github.com/felipelouza/gsufsort/issues/5