clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 40 forks source link

[Scancode] dual license handling in the summarizer #280

Open dabutvin opened 5 years ago

dabutvin commented 5 years ago

When we are gathering up the file.licenses.license or file.licenses.spdx_license_key data we default to AND them together and this is not always correct.

Given this rust crate

We list these discovered files as Apache-2.0 and MIT, but this is only because scancode found both and we AND them together.

The scancode output has other information we should probably consume for example:

             {
                "key": "apache-2.0",
                "score": 20,
                "short_name": "Apache 2.0",
                "category": "Permissive",
                "owner": "Apache Software Foundation",
                "homepage_url": "http://www.apache.org/licenses/",
                "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
                "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
                "spdx_license_key": "Apache-2.0",
                "spdx_url": "https://spdx.org/licenses/Apache-2.0",
                "start_line": 5,
                "end_line": 5,
                "matched_rule": {
                  "identifier": "mit_or_apache-2.0_1.RULE",
                  "license_expression": "mit OR apache-2.0",
                  "licenses": [
                    "mit",
                    "apache-2.0"
                  ],
                  "matcher": "2-aho",
                  "rule_length": 4,
                  "matched_length": 4,
                  "match_coverage": 100,
                  "rule_relevance": 20
                }
              }

cc @pombredanne

pombredanne commented 5 years ago

@dabutvin sorry for the late reply. ScanCode returns detected licenses as expressions. You get these either:

  1. withe the license_expressions attribute a files object. This is a list of license expressions strings found in a given file. If you want to get the composite expression you could wrap them in parens and AND them together.

  2. Alternatively the licenses attribute lists each license together with the corresponding license matched_rules license_expression attribute which is a single string.

In both cases the expressions are made of ScanCode license keys (which are the keys in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses)

If you want instead SPDX ids each license object that has one has an spdx_license_key attribute: so you could parse the expression and then replace the key for SPDX IDs (and possibly use SPDX LicenseRef-... for license keys that are not know from SPDX).

Alternatively, we could update ScanCode to return also license_expressions that would be made of SPDX keys.

FWIW, expressions are reasonably common in the JQuery world. Many Rust crates have such expressions. And there are several other cases: about ~20% of the 7K rules in ScanCode return expressions with more than one license

pombredanne commented 5 years ago

so in any case AND'ing license objects is unlikely to be correct as you are missing the OR and WITH cases

dabutvin commented 5 years ago

Thanks @pombredanne I'm looking into fixing this now, should I expect a significant difference in scancode 3.0?

fossygirl commented 5 years ago

@pombredanne @dabutvin Just checking on this for our upgrade to Scancode 3.0?

dabutvin commented 5 years ago

so this has come a long way, but we still have one gap. The following explains the origin of the problem, not sure what the solution should be.

Here is the repo we are using here for reference https://github.com/rust-lang/regex It is meant to be licensed as "MIT OR Apache-2.0"


With the latest scancode we get the right license on a file by file basis

[
  path: 'src/backtrack.rs',
  license: 'MIT OR Apache-2.0',
  attributions: ['Copyright 2014-2015 The Rust Project'],
  hashes: { sha1: 'd40e188e9bd713fd1b858b914bae098b320d2799' }
},
{
  path: 'src/compile.rs',
  license: 'MIT OR Apache-2.0',
  attributions: ['Copyright 2014-2016 The Rust Project'],
  hashes: { sha1: '222404078d2e3eca8f79aa5ae2be4fb4cdf11c6f' }
},

where we get it wrong is declaring the license

we are getting

{ licensed: { declared: 'Apache-2.0 AND MIT' },

To try and declare a license, first, we check the scancode_output.summary.packages[0].declared_license, but it is empty

"summary": {
  "license_expressions": [
    { "value": "mit OR apache-2.0", "count": 55 },
    { "value": null, "count": 53 },
    { "value": "apache-2.0 OR mit", "count": 2 },
    { "value": "apache-2.0", "count": 1 },
    { "value": "mit", "count": 1 },
    { "value": "mit-synopsys", "count": 1 }
  ],
  "copyrights": [{ "value": null, "count": 58 }, { "value": "Copyright (c) The Rust Project", "count": 28 }],
  "holders": [{ "value": null, "count": 58 }, { "value": "The Rust Project", "count": 28 }],
  "authors": [{ "value": null, "count": 84 }, { "value": "The Rust Project", "count": 2 }],
  "programming_language": [
    { "value": "Rust", "count": 69 },
    { "value": null, "count": 14 },
    { "value": "Objective-C", "count": 2 },
    { "value": "Python", "count": 1 }
  ],
  "packages": []
},

So we try to declare the license by reading the "is_license_text": true file propery In this case, there are 2 files marked as "is_license_text": true LICENSE-APACHE and LICENSE-MIT each with individual licenses in them.

We take these licenses and AND them together to declare a license.

_getLicenseByIsLicenseText(files) {
  const fullLicenses = files
    .filter(file => file.is_license_text && file.licenses)
    .reduce((licenses, file) => {
      file.licenses.forEach(license => {
        licenses.add(this._createExpressionFromLicense(license))
      })
      return licenses
    }, new Set())
  return this._joinExpressions(fullLicenses)
}

this gives us Apache-2.0 AND MIT

dabutvin commented 5 years ago

this is still an issue with the summarizer

see https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee

the readme is correctly detecting MIT OR Apache-2.0 but our summarizer is joining them with ANDs for the declared

ignacionr commented 4 years ago

Finally I got some good input for you guys, but it will require us to decide how to handle things.

The thing is, our summarizer for Scancode will look into the files array provided by the tool, and not in the (very smart) content.summary.license_expressions. Going file-by-file, the system finds that there are both Apache and MIT identifiable licenses, and ANDs them together by default. As seen here.

On to the decision-making part.

For reference, the package mentioned gives from Scancode the following summarized counts and values (the winner with 117 is the mix we are suggested to take into account):

[{"value":null,"count":159},{"value":"mit OR apache-2.0","count":117},{"value":"public-domain","count":16},{"value":"apache-2.0 OR mit","count":5},{"value":"apache-2.0","count":1},{"value":"mit","count":1},{"value":"mit-synopsys","count":1},{"value":"unknown","count":1}]

Still, how dangerous is it to put it to file count? What if we got a number of equal-file-count license types?

All comments appreciated.

geneh commented 4 years ago

The component's license is MIT OR Apache-2.0 and not MIT AND Apache-2.0 as initially reported for https://clearlydefined.io/definitions/crate/cratesio/-/regex/1.0.6 It is, however, MIT AND Apache-2.0 for https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee. scancode 3.2.2 was run for both of the components. Any idea why the declared license is different? As to is it dangerous or not, we would need to analyze tens or hundreds of sample packages to be sure. We probably do not want to invest time in that right now.

ignacionr commented 4 years ago

@geneh I am looking at the example you give with cratesio, but initially the harvested info is a lot more complex and for the older version also includes earlier version runs of ScanCode, which may be an explanation (even though I will debug this to make sure this is the case).

ignacionr commented 4 years ago

@geneh what would you say if we remove the original data, and recrawl? We would lose existing information from ScanCode, but I think that is exactly what is wrong. Just an idea.

geneh commented 4 years ago

@ignacionr Do you mean recrawl all the components? Why would the outcome be any different if the scanocde version is the same? Do you mean we should recompute definitions for the affected scancode results?

tmarble commented 4 years ago

I just ran scancode on the most recent version of regex and saved the result as a gist. Please note that scancode has different results for...

Compare this to the code snippet above from scancode.js which calls _joinExpressions (that function always combines with AND) for all the Full license files. And thus we would expect that result to be "mit AND apache-2.0 AND unicode", but it is only "mit AND apache-2.0" because the Unicode is only referenced in the README (the full Unicode license is not in the top level directory, but rather in regex-syntax/src/unicode_tables/LICENSE-UNICODE).

Because the resulting package "regex" contains a compilation of Unicode data and Rust source code the likely correct SPDX expression for the ensemble is "(mit OR apache-2.0) AND unicode". NOTE: the parentheses are required because otherwise the AND takes precedence, per 4) Order of Precedence and Parentheses in the SPDX spec.

So here are the questions:

  1. Should we rely on the Scancode licenses for Cargo.toml or README.md ?
  2. Should we rely on the discovered full license text(s) in the top level directory?
  3. Which takes priority between 1) and 2) ?
  4. Should the answer apply to just Cargo (Rust) or other package types?
fossygirl commented 4 years ago

@jeffmcaffer @pombredanne @iamwillbar Would love to hear your opinions on this one.