felipelouza / egsa

Generalized enhanced suffix array construction in external memory [CPM'13, AMB 2017]
https://doi.org/10.1186/s13015-017-0117-9
GNU General Public License v3.0
15 stars 2 forks source link

How do we read max-length LCP text? #6

Closed antmarakis closed 4 years ago

antmarakis commented 4 years ago

Hi!

This is an excellent library, thanks for posting!

I am wondering, is there a way to get the actual max-length LCP text? When I ran the code I can see the maximum length, but not the actual LCP. Is there a way to get that done?

I am afraid I do not have much experience with advanced C/C++, so any help is greatly appreciated! Thanks!

felipelouza commented 4 years ago

Hi @antmarakis, thanks for your message.

Since max-length LCP is given by a debug option in egsa, I don't think it is a good idea adding this feature here.

But, you could use gsufsort tool: https://github.com/felipelouza/gsufsort. I just created the option --lcp_max_text, which gives you the max-length LCP text.

Example:

/gsufsort dataset/input.txt --lcp --lcp_max --lcp_max_text dataset/input.txt.4.lcp 20199320 bytes (n = 5049830) LCP max: 71 Longest LCP: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Note that gsufsort tool is not external memory. It needs 9n bytes to compute SA and LCP array.

Please, let me know if you need any help.

antmarakis commented 4 years ago

Hi @felipelouza, that solved my problem, thank you!

I am wondering, do you know how I could go about computing the LCS from a Suffix Array? Would your gsufsort library be useful in this regard? I didn't have much time to take a proper look yet, I am afraid.

Thanks!

felipelouza commented 4 years ago

Hi again,

Yes, it is easy to compute the LCP between two strings T_1 and T_2, given SA, LCP and DA. Have a look at Section 5.4.2 of Ohlebusch's book: https://www.uni-ulm.de/in/theo/m/ohlebusch/book-bioinformatics-algorithms/

I think you can do that using simply the output (.sa, .lcp and .da) of gsufsort.

Sorry I cannot do this for you now, but let me know if you have problems.

Cheers!

antmarakis commented 4 years ago

I am going to check the book out, thanks for the suggestion! I think I should be able to figure it out, cheers!