aboutcode-org / deltacode

DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.
http://www.aboutcode.org/
20 stars 27 forks source link

AssertionError when running DeltaCode on eCos scans #15

Closed JonoYang closed 6 years ago

JonoYang commented 6 years ago

I ran ScanCode with the the following options (-clipeu) on version 2.0 of eCos and the latest HEAD of the eCos CVS repo. After, I ran DeltaCode on the report files and I got the following issue:

$ deltacode -n ecos-head.json -o ~/Desktop/ecos-2.0-linux.json -c delta.csv
Traceback (most recent call last):
  File "/home/jono/nexb/tools/develop/deltacode/bin/deltacode", line 11, in <module>
    load_entry_point('deltacode', 'console_scripts', 'deltacode')()
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/cli.py", line 80, in cli
    delta = DeltaCode(new, old)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 23, in __init__
    self.deltas = self.determine_delta()
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 103, in determine_delta
    assert len(deltas) == ((self.new.files_count - new_nonfiles) + (self.old.files_count - old_nonfiles) - modified - unchanged)
AssertionError

Attached are the input files I used to get this error: ecos-scans.zip

johnmhoran commented 6 years ago

@JonoYang This is highly speculative, but at first blush it looks like the file count is off, perhaps reminiscent of the missing files issue we experienced a few weeks back. As I recall, that had something to do with how the codebases for the pair of scans were defined.

Is that a fair description of that cause? And is there a chance that your eCos scans bear some similarity to that earlier set of scans?

steven-esser commented 6 years ago

Yes something weird is going on here; I will take a look.

steven-esser commented 6 years ago

Looks like there is some discrepancy between files_count and len(index) for some scans.

Still hammering down the details but we atleast have some initial tests written that reproduce this behavior.

At first glance, it looks like there is some confusion during the index process that fails to index paths that have been aligned to ''

steven-esser commented 6 years ago

This happens after scan alignment.

steven-esser commented 6 years ago

Ok, have figured out the main cause: We are experience hash collisions during our file indexing.

This was expected for things like sha1 indexing etc, but I underestimated that it could happen to path as well (especially since we align_scan etc during the delta).

So, this fix will require a bit more work than anticipated, but handling it will allow us to tackle other problems easier (moved files etc). We would have had to make this change at some point, so we are not in a bad place.

steven-esser commented 6 years ago

Fixed with #18