blingenf / copydetect

Code plagiarism detection tool
MIT License
238 stars 36 forks source link

feat: report match totals for each test-file based on all reference files #44

Open ankostis opened 1 year ago

ankostis commented 1 year ago

Currently (v0.4.5) the tool reports match-ratios between all pairs of test <--> ref files - i will focus here on match-ratios for test files but the same applies for ref files, reversed.

Let's assume these are the reported match-ratios for test files:

  graph LR;
      T1--70%-->R1;
      T1--65%-->R2;
      T1--55%-->R3;
      T2--3%-->R1;
      T2--2%-->R2;
      T2--8%-->R3;

What i'm missing is a new summary section with all the grand total matchings for each test file vs the whole ref codebase, ie. how many lines are copies, regardless of which specific ref file matched it, something like this:

  graph LR;
      T1--98%-->R[R1+R2+R3] ;
      T2--11%-->R[R1+R2+R3] ;

When expanding the ratios in these sections I would expect to see only the "left" diff pane with the copied test-code, like a code-coverage report, reporting the number of matches for each LoC, like this: image

Does that make sense?

Workaround

Currently i have to concatenate all ref-files into a single one with a command like:

mkdir /tmp/all
find  ref_project/ -name *.cs | xargs cat /tmp/all/all.cs

... and then run against the new ref-folder:

copydetect -t test_prolect/ -r /tmp/all/ -e cs
blingenf commented 1 year ago

I agree that similarity to the test set as a whole would be useful to display. The improved bookkeeping I have planned for #34 will make this much easier to implement -- I'll address that first then see if I can get something working for aggregate similarity.

ankostis commented 1 year ago

Is this related to this comment in #33?

But Is there any way for us to get the overall match contents of the file ? ...finding the overall matched tokens from all the reference files would be merely impossible?