When the two input string sets are the same, the time to tokenize the strings, calculate the similarity and the space required to store and return the results will be (at least) twice larger than necessary. It will be much more efficient to write self-join functions for each join algorithm.
When the two input string sets are the same, the time to tokenize the strings, calculate the similarity and the space required to store and return the results will be (at least) twice larger than necessary. It will be much more efficient to write self-join functions for each join algorithm.