Output interpretation - Githubissues

bukosabino commented 6 years ago

Hi @felipelouza ,

Finally, I have run the library on a Linux machine :)

I am not sure if I interpret in the right way the normal output of this library, because I get a bigger LCS size with k=50 than with k=5. What is the meaning of the "size" in the output?

k=5

ubuntu@ip-172-31-32-99:~/egsa/egsa$ ./egsa  dataset/input-100.txt 5
SIGMA = 255
DIR = dataset/
INPUT = input-100.txt
K = 5
MEMLIMIT = 2048.00 MB
CHECK = 0
COMPUTE_BWT = 0
WORKSPACE = 13.n bytes
### PREPROCESSING ###
K = 5
PARTITIONS = 1
TOTAL = 286 bytes       0.00 MB
CLOCK = 0.000272 TIME = 0.000000
0.000272        0.000000
### PHASE 1 ###
CLOCK = 0.000125 TIME = 0.000000
0.000125        0.000000
### PHASE 2 ###
INDUCING:
alfa    TOTAL   INDUCED %:
ALL)    285     98      34.39
CLOCK = 0.004332 TIME = 0.000000
0.004332        0.000000
### TOTAL ###
CLOCK = 0.004495 TIME = 0.000000
0.004495        0.000000
milisecond per byte = 0.000000000
0.000000000
size = 285
malloc_count ### exiting, total: 1,158,870,124, peak: 1,158,641,041, current: 1,033

k=50

ubuntu@ip-172-31-32-99:~/egsa/egsa$ ./egsa  dataset/input-100.txt 50
SIGMA = 255
DIR = dataset/
INPUT = input-100.txt
K = 50
MEMLIMIT = 2048.00 MB
CHECK = 0
COMPUTE_BWT = 0
WORKSPACE = 13.n bytes
### PREPROCESSING ###
K = 50
PARTITIONS = 1
TOTAL = 2848 bytes      0.00 MB
CLOCK = 0.000360 TIME = 0.000000
0.000360        0.000000
### PHASE 1 ###
CLOCK = 0.000612 TIME = 0.000000
0.000612        0.000000
### PHASE 2 ###
INDUCING:
alfa    TOTAL   INDUCED %:
ALL)    2847    1403    49.28
CLOCK = 0.005419 TIME = 0.000000
0.005419        0.000000
### TOTAL ###
CLOCK = 0.006064 TIME = 0.000000
0.006064        0.000000
milisecond per byte = 0.000000000
0.000000000
size = 2847
malloc_count ### exiting, total: 1,159,007,790, peak: 1,158,692,569, current: 1,033

My problem is about to find the k-LCS in n (n>=k and 2<=k<=n) strings. So, when k=5 the LCS value should be >= than when k=50.

felipelouza commented 6 years ago

Hi @bukosabino, in the case you want to see the average LCP, you should run the check procedure -c: ./egsa dataset/input-100.txt 5 -c. Also, it is not computing the Longest Common Substring (LCS), this is the average value in the Longest Common Prefix (LCP) array.

bukosabino commented 6 years ago

So, do you think this library resolve other problem that I need?

What relation have this library with this paper: https://link.springer.com/article/10.1007/s00453-009-9369-1?

Best,

felipelouza commented 6 years ago

Yes, I think so. It only computes the data structures used by this paper to compute LCSs. In the case you want to compare strings (using another distance measure), I have implemented this tool: https://github.com/felipelouza/bwsd Best!

bukosabino commented 6 years ago

Cool library. But, I don't need Burrows-Wheeler measure at this moment.

I need to calculate k-LCS in a big string collection. I use this library: https://github.com/ptrus/suffix-trees. But, I have some performance problems because a have a lot of strings and with big size :(

This is the reason I read this paper https://link.springer.com/article/10.1007/s00453-009-9369-1 and I find your repo. What do you recommend me?

Good job sharing code Felipe!

felipelouza commented 6 years ago

I see.

Here you can find the implementation for the paper you have mentioned: https://www.uni-ulm.de/in/theo/research/seqana.html Also, I know this related repository: https://github.com/giovannarosone/cLCP-mACS

Best!

felipelouza / egsa

Output interpretation #5