aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 545 forks source link

Add tests for all licenses in https://calculate-linux.org/packages/licenses/ #1499

Open pombredanne opened 5 years ago

pombredanne commented 5 years ago

Description

See https://calculate-linux.org/packages/licenses/ as this could be yet another source of licenses. Based on reports by @reversi-fun in https://github.com/spdx/license-list-XML

reversi-fun commented 5 years ago

Fine.

In my tool, I made a correspondence table between the scanCode-Tool-kit license and the calculate-linux and spdx license IDs. Surely,The following csv-file will be useful for your verification work. https://github.com/reversi-fun/license_doc_similality1/blob/master/data/filePattern2License.csv license-jPython-DOR

Perhaps you'll struggle with the resolution of the per-version license of Python. Can you identify calculatr-Linux/JPython as a different license for {calculatr-Linux/Python, spdx/CNRI-Python-GPL-Compatible, FSF/DOR}?

From the directed graph at the following URL, if you find the similarity between the licenses shown in the attached figure, you will find the difficulty. (or Lack of scancode-toolkit\src\licensedcode\data\licenses data)

https://github.com/reversi-fun/license_doc_similality1/blob/master/data/lic_graph.fdp.svg

reversi-fun commented 5 years ago

By the way, are there any good suggestions for naming the license of Java/JSP-JSR-family in Java?

Even though the java * .java files are almost the same files, but there are three types of license changes depending on the download location and download time for each implementor.

For example, there are three types of licenses for JSR107.

  1. JSR-000107 JCache - JavaTM Temporary Caching API Final Release for Evaluation(non Patent terms) https://download.oracle.com/otndocs/jcp/jcache-1_0-fr-eval-spec/license.html

  2. file: scancode-toolkit/src/licensedcode/data/licenses/jsr-107-jcache-spec.LICENSE(includes patent terms)

  3. JSR-000107 JCache - implemented by org.sonatype.oss license(includes patent terms. Apache-2.0) https://github.com/jsr107/jsr107spec/blob/master/LICENSE.txt

license-jsr107

pombredanne commented 5 years ago

@reversi-fun thank you ++ this is super interesting! The number or links start to strech the ability to visualize all these ;) ... these are large graphics but very nice!

I have a few questions:

  1. out of the various similarity metrics, which one did you use for the ScanCode license graph?

  2. I am not sure what you meant by this:

Perhaps you'll struggle with the resolution of the per-version license of Python. Can you identify calculatr-Linux/JPython as a different license for {calculatr-Linux/Python, spdx/CNRI-Python-GPL-Compatible, FSF/DOR}?

ScanCode uses eventually a multi diff approach so it should be OK, though we did not track the CNRI DOR licenses (I have them them as rules for now and not as named licenses). Thank you ++ for that!

  1. I am curious about the context of your work if you care to elaborate.
pombredanne commented 5 years ago

@reversi-fun re:

By the way, are there any good suggestions for naming the license of Java/JSP-JSR-family in Java? Even though the java * .java files are almost the same files, but there are three types of license changes depending on the download location and download time for each implementor.

Not really, I name them as they come. What is sure is that Sun then Oracle created a jolly mess with these if you want my take :smiley_cat: and as you quite rightly found out: we have often almost the same code and many licenses depending the dates and download places.

pombredanne commented 5 years ago

@reversi-fun btw I fetched the various CNRI/Python/Jython licenses you mentioned above:

And the scan results are there and seem pretty good now that I have added the rare DOR license as rules with https://github.com/nexB/scancode-toolkit/commit/042d9fa814cdf53995e4d58b338e5a0a9661b78a See the JSON results in cnri.txt

pombredanne commented 5 years ago

@reversi-fun

Perhaps you'll struggle with the resolution of the per-version license of Python. Can you identify calculatr-Linux/JPython as a different license for {calculatr-Linux/Python, spdx/CNRI-Python-GPL-Compatible, FSF/DOR}?

The answer to your question is that there is no "struggle" for this and that each license is properly detected and identified separately.

The scan results in cnri.txt are effectively distinguishing these since ScanCode does a multi diff:

pombredanne commented 5 years ago

BTW being a license nerd, this gave me a good laugh https://github.com/reversi-fun/license_doc_similality1/blob/master/LICENSE.txt#L3 :dancer: :+1:

reversi-fun commented 5 years ago

@pombredanne Fine! I was cheered by your praise. "license nerd", Me too??(^^)??.

You or fossology.org may be recognized as "wise person" by users of license_doc_similarity. The "bad guys" in the "I was tiered license" is the Expensive-tool-Maker such as a black-duck or a white-horse or greed-user or me who might betray you.

So, if you don't become "wise person" before "bad guys", you should be afraid that your product is said to be "a derivative of license_doc_similarity". So far, you have stated that you can create your own artifacts without using license_doc_similarity, so you are not bound by my license.

Depending on the expectations for the "wise person" of the community, it will have strong propagation (Only one generation of infection).

I have also considered adding a procedure to the license.

But I stopped such thinking as it would not be necessary for those who understand the licensse-expression. "WTFPL + MPL-2.0 - Bison-exception-2.2"

because "I was tiered".

reversi-fun commented 5 years ago

I will let you know that "wise person" needs to know.

ScanCode uses eventually a multi diff approach so it should be OK

I agree. I saw cnri.txt[https://github.com/nexB/scancode-toolkit/files/3050837/cnri.txt]. I admit, unexpectedly, that scancode-toolkit is great.

BUT..... You suddenly created a new rule file(proprietary_118.RULE). The weakness is that the "LEGAL risk" could not be found that the license would be considered by the FSF/dor as "NON-FREE". With license_doc_similality , you can discover risks without adding new data of machine learning. Even if a duty of care as a bona fide administrator is pursued, it will be enough, as it collects sample data beyond the {spdx& OSI & one set of Linux} license list.

The goal I could not reach was:

  1. Discover that deliverables of a third party is bundled.
  2. Discovery of differences from officially recognized licenses. It is the discovery of hidden legal risks.

The ability to prompt discovery of # 2 appeared to be implemented in the scancode-toolkit "match_coverage", as in "similarity metrics" in my license_doc_similarity.

The degree of my fineness will be seen in the previous issue [https://github.com/spdx/license-list-XML/issues/840]

  1. ....., which one did you use for the ScanCode license graph?

1.2-1 Open the following file in a browser, search "research/scancode-toolki" with ctrl-F, and you can see the similarity with the known license name for each file location. https://github.com/reversi-fun/license_doc_similality1/blob/master/data/lic_graph.fdp.svg

1.2-2 Open the following file in a MS-EXCEL, filter "scancode-toolkit" in the "fileIdentifier" column. Look at the "similarity (s)" and "licenseName (s)" columns. https://github.com/reversi-fun/license_doc_similality1/blob/master/data/filePattern2License.csv

"similarity (s)" "licenseName (s)"
0.87378 spdx/DOC
0.86078 calculate-Linux/ACE
0.87574 "research/scancode-toolkit/tests/licensedcode/data/more_licenses/tests/ACE/ACE-copying.html"

license-ACE-as-doc

 .  in scancode-toolkit pattern : tests\ <correct for license name> \<sample's name>
 .  in license_doc_similality1 csv data: "fileIdentifier"=<sample's name> ,"licenseName (s)"= <correct for license name>, each similarity 
  1. out of the various similarity metrics,

"similarity metrics" in license_doc_similality is use doc2vec-similarity(gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity). and Word count difference in "license graph". https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity

I understand that it is the cos similarity of the vector of frequency of occurrence of pairs of adjacent words. The gensim.doc2vec-similarity was similar to nGram analysis, and the frequency of adjacent word pairs could recognize negative sentences. The gensim.LDA-topics-similarity of mere word frequency could not distinguish between negative and positive sentences.