Checksims / checksims

Software Similarity Detection to identify academic dishonesty
Other
7 stars 5 forks source link

Consecutive algorithm runs share mutable state on the same Submission #19

Closed dmurvihill closed 9 years ago

dmurvihill commented 9 years ago

Create a test submission s1 with the text testtest and character tokenization, and another submission s2 with the phrase test and the same tokenization. Set all tokens of first occurrence of string test in s1 and s2 as invalid, then run S-W on s1 and empty. Four tokens in s1 will match, even though there should be no match.

This issue is caused by the fact that client code is able to alter token validity in a Submission. Can we make the Submission token lists immutable?

mheon commented 9 years ago

So, as of right now, it's the responsibility of every algorithm to clone a copy of the token list to work on. If they don't, the token list will be shared across multiple comparisons. In cases like Line Comparison, the input token list isn't really modified at all, so this doesn't matter. It would probably be safer to automatically make a copy of the token list every time we run a comparison.

The most logical place to add this would probably be in Similarity Detection worker (https://github.com/Checksims/checksims/blob/master/src/main/java/net/lldp/checksims/util/threading/SimilarityDetectionWorker.java).

dmurvihill commented 9 years ago

Here is a MWE.

    @Test
    public void tmp() throws InternalAlgorithmError, TokenTypeMismatchException {
        GreedyStringTiling instance = GreedyStringTiling.getInstance();
        Submission s1 = charSubmissionFromString("s1", "testtest");
        Submission s2 = charSubmissionFromString("s2", "test");
        Submission s3 = charSubmissionFromString("Empty", "");
        final TokenList aTokens = s1.getContentAsTokens();
        final TokenList bTokens = s2.getContentAsTokens();

        final Iterator<Token> aIt = aTokens.iterator();
        final Iterator<Token> bIt = bTokens.iterator();

        while(aIt.hasNext() && bIt.hasNext()) {
            final Token aTok = aIt.next();
            final Token bTok = bIt.next();
            if(aTok.equals(bTok)) {
                aTok.setValid(false);
                bTok.setValid(false);
            }
        }

        SmithWaterman sw = SmithWaterman.getInstance();
        AlgorithmResults results = sw.detectSimilarity(s1, s3);
        checkResultsNoMatch(results, s1, s3);
    }
dmurvihill commented 9 years ago

I agree that the best way to approach this is by operating on copies of the TokenLists, but it would be better to force this behavior by reducing mutability in the Submission class itself, rather than worrying about safety in all of the client code that accesses it.

For example, I am working on Running Karp-Rabin Greedy String Tiling, and my test class starts like this:

    private final GreedyStringTiling instance = GreedyStringTiling.getInstance();
    private static final Submission test = charSubmissionFromString("test", "test");
    private static final Submission empty = charSubmissionFromString("empty", "");
    private static final Submission typeMismatch =
            whitespaceSubmissionFromString("type mismatch", "this is a whitespace thing");

I believe that I should be able to do this and expect that my tests will be orthogonal. At present I have no such guarantee, and simply adding safety code in SimilarityDetectionWorker would not provide it.

mheon commented 9 years ago

As of right now, we have ImmutableToken, and the ability to create an "immutable copy" of a TokenList (not actually a separate class... should look into refactoring that at some point). Perhaps an ImmutableSubmission wrapper should be added as well, ensuring that the TokenList returned by getTokenList() is immutable.

The SimilarityDetector interface could then be refactored to only work on ImmutableSubmission (and possibly the SubmissionPreprocessor interface as well).

Thoughts?

dmurvihill commented 9 years ago

Why is Submission mutable at all? What logic depends on mutable state contained in Submission?

mheon commented 9 years ago

Submission itself is actually entirely immutable. However, it holds a generic TokenList which has no guarantee of immutability, and allows this list to be retrieved at will.

We could potentially make the TokenList in Submission always immutable. This might break some existing code, but it would seem to be the right thing to do.

dmurvihill commented 9 years ago

I think that's the way to go. I'll guarantee that `Submission.tokenList' is immutable, and modify all accessors to return a mutable copy, and fix any resulting test failures. Sound good?

mheon commented 9 years ago

I almost feel like you shouldn't bother making the accessors return mutable copies unless you need to. Try it without and see what breaks... If it's not too much, we can just fix that. I'm slightly concerned about spending too much time copying token lists if every access is a copy.

dmurvihill commented 9 years ago

Fair enough, I will do that.