aboutcode-org / deltacode

DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.
http://www.aboutcode.org/
20 stars 27 forks source link

use license scan information to further determine modified. #12

Closed steven-esser closed 6 years ago

steven-esser commented 6 years ago

@mjherzog brought this up in a recent call.

For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.

For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.

steven-esser commented 6 years ago

First and foremost, we will need to update our Scan object (#2) to hold scancode option information.

Additionally, we will need to modify our File object to hold some or all of the scancode license information, like license key, expression, owner etc.

Finally the meat of the work is a post-determine_delta() function that runs across only the modified category of Delta objects. We may want to modify our 'deltas' data structure from a list to a dictionary as well.

mjherzog commented 6 years ago

The requested enhancement is to evaluate whether a modified file includes a license change. This would not change its status as a "modified" file. We need some other way to show it is modified, but that there was no license change - based on the same license expression in both files. We could possibly compare the set of detected licenses, but using just license expression would probably be more efficient.

steven-esser commented 6 years ago

Yes, the naming is something I will discuss with @johnmhoran. There will likely be some changes as we drill down into specific categories of deltas.

johnmhoran commented 6 years ago

@majurg I've added the entirety of the ScanCode licenses value to the File object, and it's now displayed as well as part of the OrderedDict returned by the DeltaCode object's to_dict() method. The order of the key/value pairs inside the licenses field, however, differs from its order in a ScanCode scan, e.g., start_line and end_line are no longer adjacent to one another.

This should not affect processing, but makes the data less readable by users IMHO. I think this could readily be corrected by using an OrderedDict for the licenses, but perhaps that would be overkill. What do you think? And should I commit and push my initial work on this issue before you take the time to consider the question?

johnmhoran commented 6 years ago

@majurg One other thought: while I'm working on the modified category, shall I change unchanged to unmodified throughout the DeltaCode codebase? Or does this merit a separate issue?

steven-esser commented 6 years ago

Either way is fine; Another possible solution would be to create a new License class.

You can make that change at any time, thats fine.

johnmhoran commented 6 years ago

Thanks, @majurg -- I'll try the License class approach (and will change unchanged to unmodified).

johnmhoran commented 6 years ago

@majurg While I'm able to add the ScanCode licenses value as a new File object attribute using setattr(self, 'licenses', dictionary.get('licenses')), I've tried a wide range of alternative structures using a new License class that accepts a dictionary, but have not yet succeeded in adding the licenses value with this separate class. This is an example of the most common error I've encountered:

TypeError: <deltacode.models.License instance at 0x0386CA80> is not JSON serializable

Despite extensive research and testing, I've been unable to figure out how to resolve. Perhaps we can have a brief discussion this morning when we're both in the office.

steven-esser commented 6 years ago

I am assuming you are using print statements/printing out the json to test your results.

Two options: 1) ignore this for now and do all your testing and debugging via the test suite.

2) modify the json output function to account for this new Licenses field. IMO this is not the best as we do not really want to output this license information at all, we just want to use it for comparing two File objects

johnmhoran commented 6 years ago

45 tests pass, 2 fail, both throwing variations on TypeError: <deltacode.models.License instance at 0x046D1508> is not JSON serializable at this line: json_output = json.dumps(dict_output). Testing by running some deltacode command-line tests produces the same error.

steven-esser commented 6 years ago

what tests specifically

johnmhoran commented 6 years ago
    def test_DeltaCode_json_file_added(self):
        new_scan = self.get_test_loc('deltacode/new_added1.json')
        old_scan = self.get_test_loc('deltacode/old_added1.json')

        result = DeltaCode(new_scan, old_scan)
        dict_output = result.to_dict()
        json_output = json.dumps(dict_output)
        loaded_json = json.loads(json_output)

        assert loaded_json['deltas_count'] == 9
        assert loaded_json['deltacode_stats']['added'] == 1
        assert loaded_json['deltacode_stats']['modified'] == 0
        assert loaded_json['deltacode_stats']['removed'] == 0
        assert loaded_json['deltacode_stats']['unchanged'] == 8

    def test_DeltaCode_json_file_modified(self):
        new_scan = self.get_test_loc('deltacode/new_modified1.json')
        old_scan = self.get_test_loc('deltacode/old_modified1.json')

        result = DeltaCode(new_scan, old_scan)
        dict_output = result.to_dict()
        json_output = json.dumps(dict_output)
        loaded_json = json.loads(json_output)

        assert loaded_json['deltas_count'] == 8
        assert loaded_json['deltacode_stats']['added'] == 0
        assert loaded_json['deltacode_stats']['modified'] == 1
        assert loaded_json['deltacode_stats']['removed'] == 0
        assert loaded_json['deltacode_stats']['unchanged'] == 7
steven-esser commented 6 years ago

@johnmhoran why do we dump as json, then load that same json back to a dict?

johnmhoran commented 6 years ago

@majurg Looks like that's a result of adapting the tests from the time when the to_dict() method was to_json() and produced JSON output. I made the following change to both failing tests and they now pass:

        result = DeltaCode(new_scan, old_scan)
        # dict_output = result.to_dict()
        # json_output = json.dumps(dict_output)
        # loaded_json = json.loads(json_output)

        loaded_json = result.to_dict()

However, when printing json.dumps(data) or selecting the JSON output option, I still receive a variation of this error: TypeError: <deltacode.models.License instance at 0x038E3210> is not JSON serializable.

Clearly I need to fix this, and I'll need your guidance to understand it first. I also need a way to test the new License class. Visualizing it in some way would also be helpful -- right now, when I print data defined like this: data = delta.to_dict(), the output for the licenses field looks like this: 'licenses': <deltacode.models.License instance at 0x03E31A80>. Not what we want.

steven-esser commented 6 years ago

@johnmhoran yes it will always fail to serialize to json because we do not have a to_dict function for License object. So, thats simple enough to add that, and then call License.to_dict() in the appropriate location in DeltaCode object's to_dict().

Adding a __repr__ function to our License object fixes the object hash from being printed.

As far as tests go, they should be similar to the File object tests we have.

johnmhoran commented 6 years ago

@majurg Thanks for your help earlier. I've revised the License constructor, added to_dict() methods for the File and License classes, and referred to the File to_dict() method inside the DeltaCode to_dict() method. Not only do all 47 tests now pass, but both the printed JSON and the generated JSON files throw no errors, look good, and include the ScanCode licenses key/value pairs in each of the old and new Delta objects.

Pushing shortly, and then I'll start working on some new tests for the latest set of changes.

steven-esser commented 6 years ago

@johnmhoran The next step should be updating our DeltaCode.deltas field to an dictionary where the keys are 'added', 'removed', 'modified', 'unmodified', and the values are list of corresponding delta objects.

johnmhoran commented 6 years ago

Excellent, @majurg . I'll get started on it.

johnmhoran commented 6 years ago

@majurg I was about to push my commit when I reread your instruction and realized that I misinterpreted the goal. Rather than modifying the DeltaCode.deltas field, I've modified the deltas key in the dictionary generated by DeltaCode.to_dict() to group Delta objects by a category-based key (and modified generate_csv() and the test helper function to accommodate the changes).

I'll need to revert all those changes and tackle DeltaCode.deltas instead but, before I do, I want to confirm that that's what I should do.

steven-esser commented 6 years ago

Yes, the deltas field should be changed; this will subsequently require an update of to_dict and perhaps generate_csv as well

steven-esser commented 6 years ago

we can have a session if you want as well.

johnmhoran commented 6 years ago

@majurg Sorry about misinterpreting what really was a clear instruction. I've been so focused on the JSON deltas key and its Delta objects that when I saw deltas, that's the association I automatically made.

I think I have a good sense of how to approach DeltaCode.deltas/determine_delta() but a brief session would be helpful -- I'm available whenever it's convenient for you.

johnmhoran commented 6 years ago

@majurg Taking a break, will then add several more tests for the license_changes key/value pair, e.g., no licenses key/value present in scan and no license changes in modified Delta objects.

johnmhoran commented 6 years ago

@majurg Given the goals we discussed yesterday, is there any reason for us to retain the license_changes key/value pair in our determine_delta() OrderedDict?

https://github.com/nexB/deltacode/blob/196fabd643333f53d50f2a163e677e43c8d7d221/src/deltacode/__init__.py#L49-L55

steven-esser commented 6 years ago

No, you can remove that.

johnmhoran commented 6 years ago

Thanks, will do.

johnmhoran commented 6 years ago

@majurg I think this is ready for a new PR -- let me know if you agree and if so I'll open the PR.

steven-esser commented 6 years ago

@johnmhoran comments left on latest commit

steven-esser commented 6 years ago

@johnmhoran you can open a PR when you are ready so this can get merged in.

johnmhoran commented 6 years ago

@majurg Just saw your message. Excellent. PR enroute.

What would you like me to focus my efforts on for the rest of the week? I'm available to discuss via uberconf at your convenience.

steven-esser commented 6 years ago

merged #22, closing