Closed dmurvihill closed 9 years ago
So, as of right now, it's the responsibility of every algorithm to clone a copy of the token list to work on. If they don't, the token list will be shared across multiple comparisons. In cases like Line Comparison, the input token list isn't really modified at all, so this doesn't matter. It would probably be safer to automatically make a copy of the token list every time we run a comparison.
The most logical place to add this would probably be in Similarity Detection worker (https://github.com/Checksims/checksims/blob/master/src/main/java/net/lldp/checksims/util/threading/SimilarityDetectionWorker.java).
Here is a MWE.
@Test
public void tmp() throws InternalAlgorithmError, TokenTypeMismatchException {
GreedyStringTiling instance = GreedyStringTiling.getInstance();
Submission s1 = charSubmissionFromString("s1", "testtest");
Submission s2 = charSubmissionFromString("s2", "test");
Submission s3 = charSubmissionFromString("Empty", "");
final TokenList aTokens = s1.getContentAsTokens();
final TokenList bTokens = s2.getContentAsTokens();
final Iterator<Token> aIt = aTokens.iterator();
final Iterator<Token> bIt = bTokens.iterator();
while(aIt.hasNext() && bIt.hasNext()) {
final Token aTok = aIt.next();
final Token bTok = bIt.next();
if(aTok.equals(bTok)) {
aTok.setValid(false);
bTok.setValid(false);
}
}
SmithWaterman sw = SmithWaterman.getInstance();
AlgorithmResults results = sw.detectSimilarity(s1, s3);
checkResultsNoMatch(results, s1, s3);
}
I agree that the best way to approach this is by operating on copies of the TokenLists
, but it would be better to force this behavior by reducing mutability in the Submission
class itself, rather than worrying about safety in all of the client code that accesses it.
For example, I am working on Running Karp-Rabin Greedy String Tiling, and my test class starts like this:
private final GreedyStringTiling instance = GreedyStringTiling.getInstance();
private static final Submission test = charSubmissionFromString("test", "test");
private static final Submission empty = charSubmissionFromString("empty", "");
private static final Submission typeMismatch =
whitespaceSubmissionFromString("type mismatch", "this is a whitespace thing");
I believe that I should be able to do this and expect that my tests will be orthogonal. At present I have no such guarantee, and simply adding safety code in SimilarityDetectionWorker
would not provide it.
As of right now, we have ImmutableToken, and the ability to create an "immutable copy" of a TokenList (not actually a separate class... should look into refactoring that at some point). Perhaps an ImmutableSubmission wrapper should be added as well, ensuring that the TokenList returned by getTokenList() is immutable.
The SimilarityDetector interface could then be refactored to only work on ImmutableSubmission (and possibly the SubmissionPreprocessor interface as well).
Thoughts?
Why is Submission
mutable at all? What logic depends on mutable state contained in Submission
?
Submission
itself is actually entirely immutable. However, it holds a generic TokenList
which has no guarantee of immutability, and allows this list to be retrieved at will.
We could potentially make the TokenList
in Submission
always immutable. This might break some existing code, but it would seem to be the right thing to do.
I think that's the way to go. I'll guarantee that `Submission.tokenList' is immutable, and modify all accessors to return a mutable copy, and fix any resulting test failures. Sound good?
I almost feel like you shouldn't bother making the accessors return mutable copies unless you need to. Try it without and see what breaks... If it's not too much, we can just fix that. I'm slightly concerned about spending too much time copying token lists if every access is a copy.
Fair enough, I will do that.
Create a test submission
s1
with the texttesttest
and character tokenization, and another submissions2
with the phrasetest
and the same tokenization. Set all tokens of first occurrence of stringtest
ins1
ands2
as invalid, then run S-W ons1
andempty
. Four tokens ins1
will match, even though there should be no match.This issue is caused by the fact that client code is able to alter token validity in a Submission. Can we make the Submission token lists immutable?