aboutcode-org / deltacode

DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.
http://www.aboutcode.org/
20 stars 27 forks source link

Failing case if extra directory is added #17

Open chinyeungli opened 6 years ago

chinyeungli commented 6 years ago

I did some simple tests and here is my finding:

I use "balloontip-1.1.1.jar" as a sample file.

  1. Created 2 directories d1/ and d2/ and put the test file in it and then compare these 2 directories. The output is unchanged which is correct.

  2. Same setup as (1) but create a new subdirectory named test/ under d1/ and put balloontip-1.1.1.jar in it. Both the d1/balloontip-1.1.1.jar d1/test/balloontip-1.1.1.jar are returned as added.

and the d2/balloontip-1.1.1.jar is returned as removed

which is not correct as the d1/balloontip-1.1.1.jar and d2/balloontip-1.1.1.jar should return unchanged while the d1/test/balloontip-1.1.1.jar is consider as added.

  1. Same setup as (1) but create a new root/ directory and put the d1/ in it and run the deltacode from root/ to d2/. The output is unchanged which is correct.
steven-esser commented 6 years ago

The functionality you are looking for is relating to another ticket, https://github.com/nexB/deltacode/issues/4

Since you’ve nested the jar file one level lower the second time, deltacode infers this as a removal followed by an addition, when in reality that file has just been moved to another location.

I’ll keep this ticket open for reference and close it when #4 gets taken care of.

chinyeungli commented 6 years ago

@majurg not exactly the same issue.

For instance, In (1), I have /d1/balloontip-1.1.1.jar /d2/balloontip-1.1.1.jar Then, the deltacode return d1 and d2 is the same which is correct.

in case (2) I have /d1/balloontip-1.1.1.jar /d1/test/balloontip-1.1.1.jar /d2/balloontip-1.1.1.jar

which I expect the deltacode will tell me the /d1/balloontip-1.1.1.jar and /d2/balloontip-1.1.1.jar are the same and /d1/test/balloontip-1.1.1.jar is new. However, the tool tells me BOTH /d1/balloontip-1.1.1.jar /d1/test/balloontip-1.1.1.jar are added and /d2/balloontip-1.1.1.jar is removed.

So, my question is why adding a new directory make the "suppose to be the same" originally become "added/removed"

johnmhoran commented 6 years ago

@majurg Here's what we have after scanning and running DeltaCode on these 3 pairs of test codebases. The results don't seem to be entirely consistent. Putting aside the fact that we currently treat files as moved only when there's a single identical added and removed file, I think the inconsistent treatment arises at least in part from the way we remove path segments during the fix_trees()/align_trees() process.

  1. Rename the directory from d1 to d2. image

DeltaCode treats this as unmodified.

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 0,
    "removed": 0,
    "unmodified": 1
  },
  "deltas": [
    {
      "category": "unmodified",
      "path": "balloontip-1.1.1.jar",
      "name": "balloontip-1.1.1.jar",
      "type": "file",
      "size": 53842
    }
  ]
}
  1. Add a test subdirectory to d1, also containing balloontip-1.1.1.jar. image

DeltaCode treats this as 2 removed, 1 added.

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 1,
    "modified": 0,
    "moved": 0,
    "removed": 2,
    "unmodified": 0
  },
  "deltas": [
    {
      "category": "added",
      "path": "balloontip_new_test_subdirectory_new/d2/balloontip-1.1.1.jar",
      "name": "balloontip-1.1.1.jar",
      "type": "file",
      "size": 53842
    },
    {
      "category": "removed",
      "path": "balloontip_new_test_subdirectory_old/d1/balloontip-1.1.1.jar",
      "name": "balloontip-1.1.1.jar",
      "type": "file",
      "size": 53842
    },
    {
      "category": "removed",
      "path": "balloontip_new_test_subdirectory_old/d1/test/balloontip-1.1.1.jar",
      "name": "balloontip-1.1.1.jar",
      "type": "file",
      "size": 53842
    }
  ]
}
  1. Add a root directory above d1. image

DeltaCode treats this as unmodified.

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 0,
    "removed": 0,
    "unmodified": 1
  },
  "deltas": [
    {
      "category": "unmodified",
      "path": "balloontip-1.1.1.jar",
      "name": "balloontip-1.1.1.jar",
      "type": "file",
      "size": 53842
    }
  ]
}
Pratikrocks commented 3 years ago

@chinyeungli @johnmhoran @majurg how about we ignore the align scans for now, after the integration of VirtualCodebse? Maybe we could follow up in a separate branch apart from the main branch.

Let's consider a directory structure as : New Directory New directory

Old Directory Old Directory

Now if the a1.py is having the same sha1 we are treating them as unchanged files, we completely ignore that their main directories are different. Mainly owing to (alignscans / fix trees).

But instead of that what if we do not allow changing the main root directory? Now what I propose is we should also treat them as per their main root directory (not just their sub dir). If we do so we do not need the extra burden of aligning the scans .

The scans will be aligned as they are loaded from the Virtual Codebase as a resource objects. We can safely ignore all aligning.

Now for a file to have the status as unmodified it must have the same path(full path) along with the same sha1. for a file of status moved it must exist in some other subdirs in new_scan along with that it must have the same sha1. And so on for other status ....

And also the codebase would be a lot cleaner than now

@chinyeungli @johnmhoran @majurg need your views upon this

steven-esser commented 3 years ago

@Pratikrocks I agree with removing/ignoring alignment for the first implementation of adding virtualcodebase.

Pratikrocks commented 3 years ago

Okay