Copyleaks / plagiarism-report

Allow Copyleaks API users to view the plagiarism report using their downloaded data. Using this report allows users to view the report anytime without being restricted by the Copyleaks expiration policy (that deletes the reports data).
https://api.copyleaks.com
Other
17 stars 8 forks source link

Calculation bugs in statistics.ts #98

Open jcberquist opened 1 year ago

jcberquist commented 1 year ago

There are a couple of issues that we have noticed in the statistics calculations int the statistics.ts utility file, specifically in the mergeWordIntervals function. First in the code path where the reducer checks for what it calls "same start inclusion":

    // same start inclusion
    // [---p---]  or
    // [--n--]
    else if (next.start === prev.start && next.end < prev.end) {
      if (next.type < prev.type) {
        joined[joined.length - 1] = next;
        joined.push({ ...prev, start: next.start });
      }
    }

Specifically on the joined.push the start of the interval is set to next.start but I think that should be set to next.end. As it stands now, the two intervals will overlap with each other.

Secondly, there is one code path where two new intervals are pushed onto the stack at the same time, in the "complete inclusion" path:

    // complete inclusion
    // [---p---]
    //  [--n--]
    else if (prev.start < next.start && next.end < prev.end) {
      if (next.type < prev.type) {
        joined[joined.length - 1] = { ...prev, end: next.start };
        joined.push(next);
        joined.push({ ...prev, start: next.end });
      }
    }

Effectively what this path does is it takes the last interval in the joined array and the next interval passed in by the reducer, and converts them to three sequential non-overlapping intervals. However, since the reducer only ever compares the last item in the joined array to the next item coming in, there are scenarios where the interval on the next pass will start before the last item on the joined array (and overlap with the second to last item in the joined array), with the result that when that next interval is compared to its previous, it will just be pushed onto the array. For example, consider the following sequence of intervals:

[----p----]
   [--n--]
   [---f---]

After the first loop (assuming p is in the joined array, and n is next) you have:

[-]
   [--n--]
          [-]
   [---f---]

But since the next item (f) in this case now starts before the last item on the joined array (and overlaps with the second to last item in the array), there is no code path for that situation, and it hits the else block, and is pushed onto the joined array as is. Again, this means that matches can overlap even after a pass through the mergeWordIntervals function.

These overlapping intervals result in calculations of match percentage that are erroneously high.