Closed steven-esser closed 6 years ago
First and foremost, we will need to update our Scan object (#2) to hold scancode option information.
Additionally, we will need to modify our File object to hold some or all of the scancode license information, like license key, expression, owner etc.
Finally the meat of the work is a post-determine_delta() function that runs across only the modified
category of Delta objects. We may want to modify our 'deltas' data structure from a list to a dictionary as well.
The requested enhancement is to evaluate whether a modified file includes a license change. This would not change its status as a "modified" file. We need some other way to show it is modified, but that there was no license change - based on the same license expression in both files. We could possibly compare the set of detected licenses, but using just license expression would probably be more efficient.
Yes, the naming is something I will discuss with @johnmhoran. There will likely be some changes as we drill down into specific categories of deltas.
@majurg I've added the entirety of the ScanCode licenses
value to the File object, and it's now displayed as well as part of the OrderedDict returned by the DeltaCode object's to_dict()
method. The order of the key/value pairs inside the licenses
field, however, differs from its order in a ScanCode scan, e.g., start_line
and end_line
are no longer adjacent to one another.
This should not affect processing, but makes the data less readable by users IMHO. I think this could readily be corrected by using an OrderedDict for the licenses
, but perhaps that would be overkill. What do you think? And should I commit and push my initial work on this issue before you take the time to consider the question?
@majurg One other thought: while I'm working on the modified
category, shall I change unchanged
to unmodified
throughout the DeltaCode codebase? Or does this merit a separate issue?
Either way is fine; Another possible solution would be to create a new License class.
You can make that change at any time, thats fine.
Thanks, @majurg -- I'll try the License class approach (and will change unchanged
to unmodified
).
@majurg While I'm able to add the ScanCode licenses
value as a new File
object attribute using setattr(self, 'licenses', dictionary.get('licenses'))
, I've tried a wide range of alternative structures using a new License
class that accepts a dictionary, but have not yet succeeded in adding the licenses
value with this separate class. This is an example of the most common error I've encountered:
TypeError: <deltacode.models.License instance at 0x0386CA80> is not JSON serializable
Despite extensive research and testing, I've been unable to figure out how to resolve. Perhaps we can have a brief discussion this morning when we're both in the office.
I am assuming you are using print statements/printing out the json to test your results.
Two options: 1) ignore this for now and do all your testing and debugging via the test suite.
2) modify the json output function to account for this new Licenses field. IMO this is not the best as we do not really want to output this license information at all, we just want to use it for comparing two File objects
45 tests pass, 2 fail, both throwing variations on TypeError: <deltacode.models.License instance at 0x046D1508> is not JSON serializable
at this line: json_output = json.dumps(dict_output)
. Testing by running some deltacode command-line tests produces the same error.
what tests specifically
def test_DeltaCode_json_file_added(self):
new_scan = self.get_test_loc('deltacode/new_added1.json')
old_scan = self.get_test_loc('deltacode/old_added1.json')
result = DeltaCode(new_scan, old_scan)
dict_output = result.to_dict()
json_output = json.dumps(dict_output)
loaded_json = json.loads(json_output)
assert loaded_json['deltas_count'] == 9
assert loaded_json['deltacode_stats']['added'] == 1
assert loaded_json['deltacode_stats']['modified'] == 0
assert loaded_json['deltacode_stats']['removed'] == 0
assert loaded_json['deltacode_stats']['unchanged'] == 8
def test_DeltaCode_json_file_modified(self):
new_scan = self.get_test_loc('deltacode/new_modified1.json')
old_scan = self.get_test_loc('deltacode/old_modified1.json')
result = DeltaCode(new_scan, old_scan)
dict_output = result.to_dict()
json_output = json.dumps(dict_output)
loaded_json = json.loads(json_output)
assert loaded_json['deltas_count'] == 8
assert loaded_json['deltacode_stats']['added'] == 0
assert loaded_json['deltacode_stats']['modified'] == 1
assert loaded_json['deltacode_stats']['removed'] == 0
assert loaded_json['deltacode_stats']['unchanged'] == 7
@johnmhoran why do we dump as json, then load that same json back to a dict?
@majurg Looks like that's a result of adapting the tests from the time when the to_dict()
method was to_json()
and produced JSON output. I made the following change to both failing tests and they now pass:
result = DeltaCode(new_scan, old_scan)
# dict_output = result.to_dict()
# json_output = json.dumps(dict_output)
# loaded_json = json.loads(json_output)
loaded_json = result.to_dict()
However, when printing json.dumps(data)
or selecting the JSON output option, I still receive a variation of this error: TypeError: <deltacode.models.License instance at 0x038E3210> is not JSON serializable
.
Clearly I need to fix this, and I'll need your guidance to understand it first. I also need a way to test the new License class. Visualizing it in some way would also be helpful -- right now, when I print data
defined like this: data = delta.to_dict()
, the output for the licenses
field looks like this: 'licenses': <deltacode.models.License instance at 0x03E31A80>
. Not what we want.
@johnmhoran yes it will always fail to serialize to json because we do not have a to_dict
function for License object. So, thats simple enough to add that, and then call License.to_dict()
in the appropriate location in DeltaCode object's to_dict().
Adding a __repr__
function to our License object fixes the object hash from being printed.
As far as tests go, they should be similar to the File object tests we have.
@majurg Thanks for your help earlier. I've revised the License
constructor, added to_dict()
methods for the File
and License
classes, and referred to the File
to_dict()
method inside the DeltaCode
to_dict()
method. Not only do all 47 tests now pass, but both the printed JSON and the generated JSON files throw no errors, look good, and include the ScanCode licenses
key/value pairs in each of the old
and new
Delta
objects.
Pushing shortly, and then I'll start working on some new tests for the latest set of changes.
@johnmhoran The next step should be updating our DeltaCode.deltas field to an dictionary where the keys are 'added', 'removed', 'modified', 'unmodified'
, and the values are list of corresponding delta objects.
Excellent, @majurg . I'll get started on it.
@majurg I was about to push my commit when I reread your instruction and realized that I misinterpreted the goal. Rather than modifying the DeltaCode.deltas
field, I've modified the deltas
key in the dictionary generated by DeltaCode.to_dict()
to group Delta
objects by a category-based key (and modified generate_csv()
and the test helper function to accommodate the changes).
I'll need to revert all those changes and tackle DeltaCode.deltas
instead but, before I do, I want to confirm that that's what I should do.
Yes, the deltas
field should be changed; this will subsequently require an update of to_dict
and perhaps generate_csv as well
we can have a session if you want as well.
@majurg Sorry about misinterpreting what really was a clear instruction. I've been so focused on the JSON deltas
key and its Delta
objects that when I saw deltas
, that's the association I automatically made.
I think I have a good sense of how to approach DeltaCode.deltas/determine_delta()
but a brief session would be helpful -- I'm available whenever it's convenient for you.
@majurg Taking a break, will then add several more tests for the license_changes
key/value pair, e.g., no licenses
key/value present in scan and no license changes in modified
Delta objects.
@majurg Given the goals we discussed yesterday, is there any reason for us to retain the license_changes
key/value pair in our determine_delta()
OrderedDict?
No, you can remove that.
Thanks, will do.
@majurg I think this is ready for a new PR -- let me know if you agree and if so I'll open the PR.
@johnmhoran comments left on latest commit
@johnmhoran you can open a PR when you are ready so this can get merged in.
@majurg Just saw your message. Excellent. PR enroute.
What would you like me to focus my efforts on for the rest of the week? I'm available to discuss via uberconf at your convenience.
merged #22, closing
@mjherzog brought this up in a recent call.
For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.
For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.