chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Longest Common Subsequence #11433

Closed buddha314 closed 5 years ago

buddha314 commented 5 years ago

Following my habit of using issues as feature requests.

I came across this commit for longest common subsequence and I would love to see this pushed into the string library or other basic text processing library with some of the functionality from python's difflib specifically SequenceMatcher()

bradcray commented 5 years ago

Someone who reviewed the code in question (@ben-albrecht, @benharsh?) should correct me if I'm wrong, but I think the implementation in that test is exponential in nature (i.e., will start to take forever quickly as string sizes grow) and that we'd probably want to see a dynamic programming implementation before putting it into a blessed library.

buddha314 commented 5 years ago

blessed? So mote it be... Nice theming!

ben-albrecht commented 5 years ago

@bradcray - That sounds right to me. It probably wouldn't be too hard to write a dynamic programming implementation.

For anyone interested in this good first issue, here's a Chapel implementation derived from a python implementation to start:

proc lcs(a: string, b: string): string {
  var lengths: [1..a.size+1, 1..b.size+1] int;
  for (i, x) in zip(1..a.size, a) {
    for (j, y) in zip(1..b.size, b) {
      if x == y {
        lengths[i+1, j+1] = lengths[i, j] + 1;
      } else {
        lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1]);
      }
    }
  }

  var result = '';

  var x = a.size+1,
      y = b.size+1;

  while x != 1 && y != 1 {
    if lengths[x, y] == lengths[x-1, y] then x -= 1;
    else if lengths[x, y] == lengths[x, y-1] then y -= 1;
    else {
      assert(a[x-1] == b[y-1]);
      result = a[x-1] + result;
      x -= 1;
      y -= 1;
    }
  }
  return result;
}

// Tests
writeln(lcs('1234', '1224533324')); // 1234
writeln(lcs('thisisatest', 'testing123testing')); // tsitest

I'm not sure what standard module this would fit into - maybe it ought to live in a mason package to begin with.

daviditen commented 5 years ago

I wrote something like this a long time ago. test/trivial/diten/lcs/lcs.chpl. It could probably use some cleaning up and modernization if we wanted to make it an official thing.

ankugupt commented 5 years ago

I tried writing code for common subsequence whose algorithm looks similar to @ben-albrecht . Kindly take a look at it .

`proc input_string(out str1: string, out str2: string) { var infile=open(filename, iomode.r).reader();

infile.read(str1); infile.read(str2); infile.close(); }

proc lcs ( str1:string,str2:string,len1,len2) { var L:[0..len1,0..len2] int;

for i in 0..len1
{
    for j in 0..len2
    {
        if i==0 or j==0 
        {
            L[i,j]=0;
        }
        elif str1[i-1]==str2[j-1]
        {
            L[i,j]=L[i-1,j-1]+1 ;
        }
        else
        {
        L[i][j] = max(L[i-1][j], L[i][j-1])  
        }

    }
}
var index : int = L[len1][len2];

var lcs[index+1] : string;
lcs[index] = '';

var m:int = len1;
var n:int = len2;

while m > 0 and n > 0 

{

  if str1[m-1] == str2[n-1]) 
  { 
      lcs[index-1] = str1[i-1];  
      m-=1; n-=1; index-=1;     
  } 

  elif  (L[m-1][n] > L[m][n-1]) 
     m-=1;
  else
     n-=1; 

}

return lcs;

}

proc main() {

config var filename="commonsubs";

var str1,str2 : string ;
var len1,len2 : int ;

input_string(str1,str2);

len1= str1.size;
len2= str2.size;

writeln(lcs(str1,str2,len1,len2));

}`

ben-albrecht commented 5 years ago

Hi @ankugupt - thanks for your interest. I expect this issue to be closed by the addition of the StringUtils package to the mason registry.

If you think your implementation is an improvement over that one, I encourage you to contribute to the StringUtils package: https://github.com/AnubhavUjjawal/Chapel-StringUtils