Suffix array from multi-character tokens

felipelouza / gsa-is

Inducing enhanced suffix arrays for string collections [DCC'16, TCS 2017]

https://doi.org/10.1016/j.tcs.2017.03.039

MIT License

24 stars 5 forks source link

Suffix array from multi-character tokens #3

Closed dhowe closed 2 months ago

dhowe commented 3 months ago

I need to build a suffix array on multi-character tokens (basically 'words') rather than characters. In my first attempt I created a sorted 'alphabet' of unique tokens from the input, and assigned each an integer, then passed the string of these integer indexes, joined with a separator into the algorithm. This works, but require lots of string joins/splits, and there are additional suffixes computed that start with the separator which is not ideal.

I'm wondering if there a better way to do this, perhaps computing the SA directly on the array rather than on a string ??

A toy example given the input array [ "if", "it", "is", "true", "it", "is", "true"].

[
    0: if it is true it is true
    1: is true
    2: is true it is true
    3: it is true
    4: it is true it is true
    5: true
    6: true it is true
]

moved from: https://github.com/felipelouza/gsufsort/issues/5

felipelouza commented 3 months ago

Hi Daniel, perhaps the simplest solution is to concatenate your input set ["if", "it", "is", "true", "it", "is", "true"] into a single string S[0,n-1], mark positions you are interested (for words, the initial symbol) with a bitvector B[0,n-1], such as:

S = if it is true it is true B = 100100100100001001001000

Then compute the suffix array and scan it to "extract" your output,

j = 0;
for i=0, 1,..., n-1{
  if(B[SA[i]] == 1)
    SA'[j++] = rank_1(SA[i]);
}

Gotcha?

dhowe commented 3 months ago

Doesn't this involve extra work (and extra space) to compute all the suffixes for the entire string, when all we need are a subset and we ignore the rest ? Using the same example:

[
  e,
  eitistrue,
  fitistrueitistrue,
  ifitistrueitistrue,
  istrue,
  istrueitistrue,
  itistrue,
  itistrueitistrue,
  rue,
  rueitistrue,
  strue,
  strueitistrue,
  tistrue,
  tistrueitistrue,
  true,
  trueitistrue,
  ue,
  ueitistrue
]

[
    if it is true it is true
    is true
    is true it is true
    it is true
    it is true it is true
    true
    true it is true
]

felipelouza commented 3 months ago

Yes, it does. The running time is still linear to the input size.