Rerito / Penjing

C++20 Generalized Suffix Tree implementation
MIT License
12 stars 0 forks source link

Leaf functionality missing #3

Open nonbasketless opened 5 years ago

nonbasketless commented 5 years ago

Hello Rerito! Thank you for writing this elegant library.

I'm working on finding largest common substrings of relatively big data, and I'm suspecting the leaf functionality needs to be in place for this to work. Would it entail a lot of work? I'm a bit clueless, but if you point me in the right direction I'd be happy to do it myself and share.

Rerito commented 5 years ago

Hi! Thanks for your message!

Before I dig into the subject at hand, I suggest you take a look an my second go at suffix trees (requires C++17 for string_view): https://github.com/Rerito/suffix-tree-v2

Since you are looking for an algorithm to find the longest common substring from a generalized suffix tree, I will elaborate a bit on that. The first thing to note is that the longest common substring is guaranteed to be an explicit state of the tree, so it is directly represented by a node in the tree (you can prove that using a reductio a absurdum argument).

There is a top-down approach that goes like this

  1. Create a queue holding pairs of (nodes, depth) and push (root, 0) to it and initialize such a pair bestMatch with (root, 0)
  2. While the queue is not empty:
    1. Pop its head (node, depth)
    2. If depth is greater than the depth in bestMatch, replace bestMatch with (node, depth)
    3. For each child node (child, childDepth), if child is common to all the strings in the tree, push it to the queue

As you can notice, with this approach, you need a way to tell for each node the set of strings of which it represents a substring. This can be done by annotating each node with the IDs of the strings of which it is a substring, this annotation step can be performed at the end of the insertion and takes linear amortized time (assuming hash sets are used). Of course that annotation can be implemented as a unordered_map< node*, unordered_set< string_id > > of some kind on the side instead of being directly embedded inside a node.

The bottom up approach only requires annotating the leaves at the expense of a suboptimal theoretical complexity. I have not yet dived into that though...

nonbasketless commented 5 years ago

Well this is very thorough, thank you kindly for the explanation!

I think I have what I need to do it on my own then, but I'm more of a computational scientist and less of a computer scientist. Would you be interested in creating a fork with a function to find the longest common substring (or better, n longest common substrings) of two strings? Ideally it'd support strings of 8 bit characters all the way through 64 bit. I need this for a client and I can pay for it, I'm asking because I think you can do this faster than me. Email probably better, try my username here @gmail.com.

nonbasketless commented 5 years ago

Oh, and one more thing: I searched the internet high and low for a good generalized suffix tree library, and your v2 never came up (though I know now I could've found it through your user page here). I guess since it's linked from SO it's so much easier to find. So, if I were you, I'd add a note to the original experiment that v2 exists ;)

Rerito commented 5 years ago

That is an excellent suggestion I will indeed reference the v2 (which is better designed from a pure software engineering point of view: there are tests, you can drop it in your project and use the CMakeLists supplied...).

As for the Longest Common Substring algorithm, it is in my todo list and I have already thought of some of the design. I'll try to tackle this during my free time and cannot offer you a guaranteed time frame! However if time is critical to you, I'ld suggest you to use an actual professional grade library like SeqAn. However I did not find their implementation of the longest common substring algorithm.