aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Override license detection by checksum #1281

Open jonasob opened 6 years ago

jonasob commented 6 years ago

While scanning some repositories, notably SignalR, we routinely come across files identified with an unknown license. Typically, this is because the file both mentions a license, and makes a reference to an external license file, where the latter is then matched as unknown. See for instance unknown_19.RULE for a popular one in SignalR.

In conjunction with defining a policy with --license-policy, this is not ideal: you hardly want to claim neither yay nor nay about unknown licenses, without looking at the files in question individually.

What I've ended up doing is that I adapted our wrapper around ScanCode with a post-process step:

  1. For any file with an unknown license match, check in whitelist.txt, which for each line contains a sha1:license tuple.
  2. If the file's sha1 matches a sha1 in the whitelist, override ScanCodes unknown with the identified license.
  3. Any files still matched as unknown, output this as part of the report in the same format for manual identification and addition to whitelist.txt.

This could probably be improved quite a bit if done in ScanCode, but it would move ScanCode further in the direction of being a compliance toolkit, rather than a scanner, so it might be that it wouldn't fit the roadmap.

pombredanne commented 6 years ago

@jonasob Thank you for the detailed report!

See my answers inline. You wrote:

While scanning some repositories, notably SignalR, we routinely come across files identified with an unknown license. Typically, this is because the file both mentions a license, and makes a reference to an external license file, where the latter is then matched as unknown. See for instance unknown_19.RULE for a popular one in SignalR.

In these cases, the best way out is to add new license detection rules. This is what I did in this commit https://github.com/nexB/scancode-toolkit/commit/adf79c252b6689f29bd7a3417604274a80dbfbaa#diff-8cc1a6e276d5f2058a1ca559b757b055R1 where I added a new rule that covers both Apache and the unknown_19.RULE rule that was fired alone otherwise.

With this no unknown is reported at all when scanning SignalR anymore and using the code in this branch https://github.com/nexB/scancode-toolkit/tree/new-licenses-and-rules

I also added a few more rules refinement to adapt some of the peculiar things in this package.

So adding new license rules / improving existing ones may be a better way IMHO as this works also across code changes and would typically apply to a class of packages from the same team/company that uses the same conventions.

Now in the special case of things such as See License.txt in the project root for license information. ... a couple things:

  1. unknown_19.RULE that was reporting unknown has been renamed to unknown-license-reference_29.RULE and updated to report the proper unknown-license-reference license key instead. That part was a bug

  2. RULEs have an optional referenced_filename attribute. Combined with the unknown-license-reference license key, the goal is to infer what the license.txt file license is and carry it to this detected reference which should be doable quite often. This is tracked in this umbrella ticket for now https://github.com/nexB/scancode-toolkit/issues/377

Think of it as a way to de-reference such a license mention to the actual file it is referencing and to the license(s) present in these files. We are not yet there, but not far since all the bits are there data-wise and @gerv (RIP, Gervase!) has shown us that this can be done in his slic tool https://github.com/gerv/slic/blob/master/inferno#L9

What I've ended up doing is that I adapted our wrapper around ScanCode with a post-process step: [..]

This is rather involved and can surely work ... until you upgrade your package to a new version. In this case, all the whitelisting work is lost as the sha1 will have quite likely changed in many cases. There are a few ways around this:

  1. use checksums that can cope with small changes such as these: https://github.com/nexB/scancode-toolkit-contrib/tree/develop/src/samecode

  2. use a slightly different approach where you create and update new license rules until the package license scan is fully satisfying in terms of reported licenses, and then store this a baseline. Then for future scans, you check against the baseline for license/copyright changes (as opposed to file changes) and only trigger a need to review this package if and only if there are license/copyright changes.

Note that @majurg and @johnmhoran have been working on DeltaCode and tracking and detecting these kind of changes "in the large" is eventually part of the goal there. (Ping: correct?)

Also, one thing could to also track the detected rule which will be always more specific than the detected license key.

In anycase, there is something needed alright in this area, as some files may just have a reference to some file at the root such as here: https://github.com/SignalR/java-client/blob/master/signalr-client-sdk/src/main/java/microsoft/aspnet/signalr/client/ConnectionBase.java#L4

This could probably be improved quite a bit if done in ScanCode, but it would move ScanCode further in the direction of being a compliance toolkit, rather than a scanner, so it might be that it wouldn't fit the roadmap.

Definitely fits the roadmap, and we are on the same page. The diffing/change tracking could IMHO be best in DeltaCode https://github.com/nexB/deltacode or alternatively as a ScanCode plugin... Generally we prefer to have focused tools... but the overall goal is compliance automation in anycase.

BUT since in the end this is about storing a conclusion of sorts for a given package, I would consider using AboutCode toolkit for this (see for instance the .ABOUT files in https://github.com/nexB/scancode-toolkit/tree/develop/thirdparty ) and store the actual license that you determined to be the right one there (and or in a database-backed system ;) )

pombredanne commented 6 years ago

@jonasob any feedback on my reply?

pombredanne commented 6 years ago

@jonasob Several new licenses rules have been merged in develop. As for the checksum/whitelisting, I am waiting for you feedback

jonasob commented 6 years ago

Thanks @pombredanne! I agree that improving the matching rules is preferred in most cases, though that's slightly more involved than whitelisting, and I do fear it's a bit beyond my capabilities as I can easily see how changing some matching rules might have unforeseen consequences elsewhere!

Regarding where to place the whitelisting, I'm also undetermined about that. I agree that ScanCode doesn't quite fit the bill, and I would expect ScanCode to report what it finds, and no more or less, and then other tools can pick that information up and do changes in post-processing, if an Upstream First strategy turned out not to work :)

So I guess I land in:

  1. ScanCode should scan and report what it finds, period.
  2. When ScanCode finds something which it can not detect or detects incorrectly, there should be some attempt at fixing this problem upstream, if there's a possibility to do so, and the matching rules should ideally be updated (and new regression tests should be done)
  3. For concluded licenses based on unknown or incorrectly identified licenses, adding that meta information via ABOUT files could be a solution, and in either case, belongs in post-processing.
pombredanne commented 6 years ago

I think there is still a need for whitelisting somehow for some use cases. This could work as a post scan or output filter plugin. The difficult part is figuring out a proper workflow and user experience.

jonasob commented 6 years ago

I don't know the architecture, but while I think an output filter makes sense, I also can not help but shake the feeling that the earlier something like that is introduced in the pipeline, the less problematic it becomes.

Do we have any structure to hook this into a cache of sorts? If there was the possibility to hook in configurable cache options, potentially with multiple caches, then we could end up with a situation where there's one r/w cache for caching data between runs, and one could add separate r/o databases which is essentially whitelists or overrides.