RFC: a plan for false positive license detection

pombredanne commented 2 years ago

Context

We are reporting too many false positive licenses. We need to fix this!

Problem

There are several false cases, yet they boil down to these types:

False detection of very short and weak license detection rules detected exactly such as:
- a URL or a project name such as a URL to a well known AGPL-licensed which is not always a sign of AGPL as in https://github.com/nexB/scancode-toolkit/issues/2877
- the detection of the word GPL in a binary https://github.com/nexB/scancode-toolkit/issues/2874
- the detection of longer may not be modified in https://github.com/nexB/scancode-toolkit/issues/2865
Detection of a license text or notice fragment which is too weak to represent a bona fide license detection alone.
Detection of longer unknown license references such as
- a "license introduction" (as in "This is licensed under....") that may be noisy when followed by a bona fide license notice or text.
- a license reference to the license in a file (as in "See file COPYING for license") where we can follow the reference
Lack of proper detection of a structured license tag found in a package manifest which is returned as an unknown license
When fragments of the same license are detected with only copyrights added in between as in https://github.com/nexB/scancode-toolkit/issues/2859
When sequence of SPDX licenses id are found in license detection tools
Please add yours!

Solution elements

We could treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

The upcoming two-step process where license matches are grouped in a license detection is another way to consider. We could detect patterns of license matches that could be resolved in a detection. For instance a license intro followed by a license notice.

The scancode-analyzer heuristics and ML-based detection of false positive is another way

porsche-rishisaxena commented 2 years ago

Hi Philippe,

In reference to our collective ORT community meeting, we touch base on the false positive license detection 2 weeks ago on version v30.1.0 where Porsche AG OSO also consolidated a report of false-positive cases. Please find attached the report for your kind reference and review.

report_false_positives.xlsx

CC: @sschuberth

PatteSI commented 2 years ago

Thank you for taking action here. I will now have a deeper look into our false positive findings as well. EDIT: I forgot to mention that everything I mention below was found using scanCode 30.1.0 At first glance it seems that many LicenseRef-scancode-free-unknown and LicenseRef-scancode-unknown-license-reference findings in our Java projects are actually found in META-INF/LICENSE files created by Maven inside the JARs Example: https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.1/jackson-dataformat-yaml-2.13.1.jar In this case line 3: "Jackson is a high-performance, Free/Open Source JSON processing library." and line 11-13: "Jackson core and extension components may be licensed under different licenses. To find the details that apply to this artifact see the accompanying LICENSE file. For more information, including possible other licensing options, contact" The latter one would probably fall into point 3 mentioned above, a reference to another license file. No idea about the first one though. Not sure why they are only found in the JAR binary and not also in the actual source repo with the same text: https://github.com/FasterXML/jackson-dataformats-text/blob/2.14/properties/src/main/resources/META-INF/NOTICE

Another interesting example is okhttp3 because the false positive that is found was actually introduced by yourself @pombredanne ;-) : https://github.com/square/okhttp/issues/4569 , The current file in my example is this one: https://github.com/square/okhttp/blob/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE . The license LicenseRef-scancode-unknown-license-reference is found in line 4: "It is subject to the terms of the Mozilla Public License, v. 2.0:" I don't understand why in this case MPL v.2.0 is not recognized correctly.

PatteSI commented 2 years ago

I created a python parser that can parse the evaluated-model.json file create by the ORT Reporter. It is currently scanning a list of problematic licenseRefscan codes which are mostly (always?) causing false positives: https://gist.github.com/PatteSI/5904f4bdfb149dc1ce8c73da53e2f6ae I parse a couple of our component and this is the result. Of course it still contains a lot of duplicates (it's a json file but github won't let me upload it as .json): falsePosFindingFinal.txt

pombredanne commented 2 years ago

@porsche-rishisaxena Thank you ++ for the list of false positive in https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1054006455 ... this is great and actionable!

pombredanne commented 2 years ago

@PatteSI re: https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1054555346

In this case line 3: "Jackson is a high-performance, Free/Open Source JSON processing library." and line 11-13: "Jackson core and extension components may be licensed under different licenses.

These tow look like basic license-related clues, but are not real license statement alright.

Here is the detection I get:

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0
        options:
            input:
                - jackson-dataformat-yaml-2.13.1.jar-extract/META-INF/NOTICE
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-03T065555.808073'
        end_timestamp: '2022-03-03T065557.886174'
        output_format_version: 2.0.0
        duration: '2.078113555908203'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: NOTICE
        type: file
        licenses:
            -   key: free-unknown
                score: '100.0'
                name: Free unknown license detected but not recognized
                short_name: Free unknown
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/free-unknown
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.yml
                spdx_license_key: LicenseRef-scancode-free-unknown
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/free-unknown.LICENSE
                start_line: 3
                end_line: 3
                matched_rule:
                    identifier: free-unknown_85.RULE
                    license_expression: free-unknown
                    licenses:
                        - free-unknown
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: yes
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: Free/Open Source
            -   key: unknown-license-reference
                score: '92.86'
                name: Unknown License file reference
                short_name: Unknown License reference
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/unknown-license-reference
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
                spdx_license_key: LicenseRef-scancode-unknown-license-reference
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                start_line: 11
                end_line: 13
                matched_rule:
                    identifier: unknown-license-reference_224.RULE
                    license_expression: unknown-license-reference
                    licenses:
                        - unknown-license-reference
                    referenced_filenames:
                        - LICENSE
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: yes
                    matcher: 3-seq
                    rule_length: 28
                    matched_length: 26
                    match_coverage: '92.86'
                    rule_relevance: 100
                matched_text: |
                    licensed under different licenses.
                    To find the details that apply to this artifact see the accompanying LICENSE file.
                    For more information, including possible other licensing options,
        license_expressions:
            - free-unknown
            - unknown-license-reference
        percentage_of_license_text: '24.37'
        scan_errors: []

Not sure why they are only found in the JAR binary and not also in the actual source repo with the same text:

This is weird and I got them the same way in both case. Could it be ORT handling things differently in these cases?

Another interesting example is okhttp3 because the false positive that is found was actually introduced by yourself

Oh well.... as the saying goes, "no good deed goes unpunished!"

https://raw.githubusercontent.com/square/okhttp/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE scans this way:

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0
        options:
            input:
                - NOTICE.1
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-03T065940.353825'
        end_timestamp: '2022-03-03T065942.146784'
        output_format_version: 2.0.0
        duration: '1.792968988418579'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: NOTICE.1
        type: file
        licenses:
            -   key: unknown-license-reference
                score: '60.0'
                name: Unknown License file reference
                short_name: Unknown License reference
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/unknown-license-reference
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
                spdx_license_key: LicenseRef-scancode-unknown-license-reference
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                start_line: 4
                end_line: 4
                matched_rule:
                    identifier: license-intro_3.RULE
                    license_expression: unknown-license-reference
                    licenses:
                        - unknown-license-reference
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: yes
                    has_unknown: yes
                    matcher: 2-aho
                    rule_length: 4
                    matched_length: 4
                    match_coverage: '100.0'
                    rule_relevance: 60
                matched_text: subject to the terms
            -   key: mpl-2.0
                score: '100.0'
                name: Mozilla Public License 2.0
                short_name: MPL 2.0
                category: Copyleft Limited
                is_exception: no
                is_unknown: no
                owner: Mozilla
                homepage_url: http://mpl.mozilla.org/2012/01/03/announcing-mpl-2-0/
                text_url: http://www.mozilla.com/MPL/2.0/
                reference_url: https://scancode-licensedb.aboutcode.org/mpl-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.yml
                spdx_license_key: MPL-2.0
                spdx_url: https://spdx.org/licenses/MPL-2.0
                start_line: 4
                end_line: 4
                matched_rule:
                    identifier: mpl-2.0_90.RULE
                    license_expression: mpl-2.0
                    licenses:
                        - mpl-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 6
                    matched_length: 6
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: 'Mozilla Public License, v. 2.0:'
            -   key: mpl-2.0
                score: '50.0'
                name: Mozilla Public License 2.0
                short_name: MPL 2.0
                category: Copyleft Limited
                is_exception: no
                is_unknown: no
                owner: Mozilla
                homepage_url: http://mpl.mozilla.org/2012/01/03/announcing-mpl-2-0/
                text_url: http://www.mozilla.com/MPL/2.0/
                reference_url: https://scancode-licensedb.aboutcode.org/mpl-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-2.0.yml
                spdx_license_key: MPL-2.0
                spdx_url: https://spdx.org/licenses/MPL-2.0
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: spdx_license_id_mpl-2.0_for_mpl-2.0.RULE
                    license_expression: mpl-2.0
                    licenses:
                        - mpl-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 50
                matched_text: MPL/2.0/
        license_expressions:
            - unknown-license-reference
            - mpl-2.0
            - mpl-2.0
        percentage_of_license_text: '33.33'
        scan_errors: []

You will note that I am using these command line options: --license --license-text --license-text-diagnostics --yaml- which means: license, with actual license text, but limited to the exact portion of text that was matched and reported as yaml directly on screen (with the dash) rather than to a file.

This overall looks like a case of where we could merge the license intro "subject to the terms" with a following notice. There are also some missing rules separately that will help catch more of the MPL URL and more of the MPL details in general.

pombredanne commented 2 years ago

@PatteSI re: https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1055541382

I created a python parser that can parse the evaluated-model.json file create by the ORT Reporter.

This is great! Ideally what I would need is a script that would fetch the code. With that I could run extractcode to extract any archive and run a scan to get the actual details. I think this can be derived from you JSON.

pombredanne commented 2 years ago

@porsche-rishisaxena re: https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1054006455

The CSV is super useful and I can derive a script to automate re scanning from this too.

In your case and @PatteSI case, creating these data required a lot of (useful) work. I am wondering what could be the tools that would make it easier to help you report these false positive.

sschuberth commented 2 years ago

Could it be ORT handling things differently in these cases?

ORT is not handling findings in binaries or sources differently per se, and is taking ScanCode findings mostly as-is (except some post-processing to remedy #2873). But it might be that some project-specific path excludes were applied in that particular case.

sschuberth commented 2 years ago

I am wondering what could be the tools that would make it easier to help you report these false positive.

For ORT, if false-positives were addressed via package configurations, we could quite easily extract the detected_license vs. the concluded_license.

@fviernau, is that something that could be done from HERE's (probably massive) amount of package configurations?

pombredanne commented 2 years ago

@fviernau

is that something that could be done from HERE's (probably massive) amount of package configurations?

If there is something that can be shared, that could be used to fix massively some of these false positive! :)

pombredanne commented 2 years ago

Here are some related issues:

https://github.com/nexB/scancode-toolkit/issues/270 reported by @yahalom5776
https://github.com/nexB/scancode-toolkit/issues/2877
https://github.com/nexB/scancode-toolkit/issues/2865 by @sschuberth and @PatteSI
https://github.com/nexB/scancode-toolkit/issues/2815 by @sschuberth
https://github.com/nexB/scancode-toolkit/issues/2769 by @mjherzog (which contains unknown words interspersed between the words of a license name)
https://github.com/nexB/scancode-toolkit/issues/2735
https://github.com/nexB/scancode-toolkit/issues/2726
https://github.com/nexB/scancode-toolkit/issues/2651
https://github.com/nexB/scancode-toolkit/issues/2502 by @tardyp
https://github.com/nexB/scancode-toolkit/issues/2371 @xu1119
https://github.com/nexB/scancode-toolkit/issues/2403
https://github.com/nexB/scancode-toolkit/issues/2374
https://github.com/nexB/scancode-toolkit/issues/2304 by @Thalley
https://github.com/nexB/scancode-toolkit/issues/2170 by @qduanmu
https://github.com/nexB/scancode-toolkit/issues/1895
https://github.com/nexB/scancode-toolkit/issues/1731

These ones can likely be fixed with the new key phrases feature:

https://github.com/nexB/scancode-toolkit/issues/2577 by @leChasseur
https://github.com/nexB/scancode-toolkit/issues/2551 by @MarcelBochtler
https://github.com/nexB/scancode-toolkit/issues/2550 by @hanna-modica

These could help with the diagnostic false positive:

https://github.com/nexB/scancode-toolkit/issues/1122 reported by @muzsielod
https://github.com/nexB/scancode-toolkit/issues/2874 by @kiranravindran90

This is an example of an weak detection for a new license:

https://github.com/nexB/scancode-toolkit/issues/2503 by @tardyp

This may help with some false positives:

https://github.com/nexB/scancode-toolkit/issues/1995 by @MankaranSingh and @armijnhemel
https://github.com/nexB/scancode-toolkit/issues/1838 by @furuholm

@rspier ping too

I think we should have a live call to discuss the options to fix these. What do you think?

bennati commented 2 years ago

@pombredanne I attach the false positives from a bunch of HERE curations, as produced by @PatteSI 's script. Hope this helps, falsepositives.txt

sschuberth commented 2 years ago

I think we should have a live call to discuss the options to fix these. What do you think?

To be frank, I believe having a live call with all reporters of false-positive mentioned here would be overkill. Also, I guess most people don't care too much how their issue is fixed as long as it is fixed.

From my side, however, I'd strongly vote against hard-coding just the reported cases as false-positives. Instead, we should

ensure that rules always contain enough words / context to confidently identify licenses in general.
never allow a score of 100% for unknown licenses.
think about tweaking the score to be based on user feedback instead of being calculated: If a rule reportedly causes many false-positives, its score could be manually lowered.

armijnhemel commented 2 years ago

From my side, however, I'd strongly vote against hard-coding just the reported cases as false-positives. Instead, we should

* ensure that rules always contain enough words / context to confidently identify licenses in general.

* never allow a score of 100% for unknown licenses.

* think about tweaking the score to be based on user feedback instead of being calculated: If a rule reportedly causes many false-positives, its score could be manually lowered.

Be careful to not fall into the "perfect is the enemy of good" trap. If trying to avoid the false positives from happening in the first place significantly complicates the code (making it harder to maintain/change/etc.) then I don't see a problem with hardcoding the false positives.

But this depends on how many of the results are false positives. @pombredanne do you have an idea of the scale of false positives? How many results are false positives? 1%? 10%? 0.0000001%?

sschuberth commented 2 years ago

But this depends on how many of the results are false positives.

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

pombredanne commented 2 years ago

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

PatteSI commented 2 years ago

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

armijnhemel commented 2 years ago

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify?

Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files.

So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

porsche-rishisaxena commented 2 years ago

Hi @pombredanne We have found further false-positive license detection where scan-code reported blank SPDX expression but when checked manually the actual license was present in the library on GitHub. Please find attached report to this thread for your kind review. Note: This time scan-code did not even report "Unknown" and was just blank.

NoLicenseDetection-report.xlsx

CC: @sschuberth

pombredanne commented 2 years ago

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

@PatteSI Thank you ++ that's super valuable input.

pombredanne commented 2 years ago

@porsche-rishisaxena re:

We have found further false-positive license detection where scan-code reported blank SPDX expression but when checked manually the actual license was present in the library on GitHub. Please find attached report to this thread for your kind review. Note: This time scan-code did not even report "Unknown" and was just blank.

Thanks. Super useful too.

PatteSI commented 2 years ago

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify?

Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files.

So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

It's not about "open source" license. I am pretty sure we can detect almost all known open source licenses. It's about "hints" to "unknown" (usually proprietary) licenses. As far as I know there is no standard on what wording has to be used in a source code file in order to place it under some arbitrary license. I am not even sure if one has to use the word "license" in order so do so. So basically if some troll wanted to place certain parts of code under a proprietary license while the rest of the project is under a different known license he could do that with some wording or weird character encoding obfuscating the automatic detection of this section. So I would argue that we always have to do a trade-off here if we want to talk about "unknown" licenses as it will never be possible to 100%. Like you said we need to rely on heuristics that hopefully will trigger on wording that someone is using when he is announcing proprietary license (not yet available in any database) while having a high fidelity in such findings. In the end the end-user should be able to decide how many of those "unkown" hints he wants to have. Some projects require very high fidelity on their license usage while other don't and also do not have the capacity to check that many findings.

armijnhemel commented 2 years ago

@sschuberth re:

That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%.

My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. That's why the hard data input is key here.

I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results.

I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify? Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files. So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"?

It's not about "open source" license. I am pretty sure we can detect almost all known open source licenses. It's about "hints" to "unknown" (usually proprietary) licenses. As far as I know there is no standard on what wording has to be used in a source code file in order to place it under some arbitrary license. I am not even sure if one has to use the word "license" in order so do so. So basically if some troll wanted to place certain parts of code under a proprietary license while the rest of the project is under a different known license he could do that with some wording or weird character encoding obfuscating the automatic detection of this section. So I would argue that we always have to do a trade-off here if we want to talk about "unknown" licenses as it will never be possible to 100%. Like you said we need to rely on heuristics that hopefully will trigger on wording that someone is using when he is announcing proprietary license (not yet available in any database) while having a high fidelity in such findings. In the end the end-user should be able to decide how many of those "unkown" hints he wants to have. Some projects require very high fidelity on their license usage while other don't and also do not have the capacity to check that many findings.

So the core question really is: what do you think "unknown license" means? Is it "there is a license but scancode doesn't know which one because it is not in its knowledgebase" (whether or not it is open or closed) or "scancode couldn't detect which license it is and threw its hands up"? This is conceptually a big difference.

PatteSI commented 2 years ago

So the core question really is: what do you think "unknown license" means? Is it "there is a license but scancode doesn't know which one because it is not in its knowledgebase" (whether or not it is open or closed) or "scancode couldn't detect which license it is and threw its hands up"? This is conceptually a big difference.

I think this discussion is a bit deviation from the original problem here. It doesn't matter what anyone thinks "unkown license" means. We are discussing how the heuristics could be improved and ways to give the end users more options to evaluate findings. I am talking here as an end-user of ORT, which is using ScanCode as a scanner. Now we started to migrate away from NexusIQ to ORT and we see sometime hundreds of those "unknown-license" findings in big projects. There are many examples of trivial finding where the heuristics/rules used in ScanCode are just to broad and get triggered for simple comments using the word "license". We are not only discussing the general problem here of how to improve the rule based findings. There are many example given in the first post. It's not only about "unknown" licenses.

pombredanne commented 2 years ago

I have attached a presentation to better grap a summary of the issue:

ScanCode-licenses-false-positive-2022-03.pptx.pdf

pombredanne commented 2 years ago

@richardfontana I would be interested to get some feedback too @opensourcepilot I was re-reading your (thought-provoking) article in https://opensource.com/article/21/7/open-source-scanning-error and we are trying to fix false positives license detections for ScanCode with this issue. Your insights would be much appreciated!

richardfontana commented 2 years ago

@sutula may find this of interest

alext34ms commented 2 years ago

On the topic of making the current rule set a bit more stringent: The feature of making certain tokens required by putting them in {{ }} is an awesome tool to sharpen SCTK even more. The challenge is that we have 31.000+ rules. But that should not stop us. 😉

Proposal: doing an "automated" retro-fit of all rules to include SPDX identifiers in {{ }}. I made a very KISS "one liner" that does just that. The result seems quite OK when looking at some sample rules.

Some thoughts for this update:

How does it affect AND/OR rules
How does it affect (false-positive*) rules and what are those even?
Does it add any accuracy?
Does it pass integration tests?
Any others that I have missed?

cd src/scancode-toolkit/src/licensedcode/data/rules
for identifier in `tac ~/tmp/spdx_identifier.list`;
do
  echo $identifier;
  for rule in `egrep -l '([^A-Z]|^)('$identifier')([^A-Z]|$)' *.RULE`;
  do
    sed -i -E s/\(\[\^A-Z\{\{\\]\|\^\)\($identifier\)\(\[\^A-Z\}\}\]\|\$\)/\\1\{\{$identifier\}\}\\3/g $rule;
  done;
done

spdx_identifier.list

Above one liner makes changes to about 6400/31000 rules.

pombredanne commented 2 years ago

@alext34ms re:

doing an "automated" retro-fit of all rules to include SPDX identifiers in {{ }}. I made a very KISS "one liner" that does just that. The result seems quite OK when looking at some sample rules.

Sleek! very smart. I like it

How does it affect AND/OR rules

I do not think there it would have any impact.

How does it affect (false-positive*) rules and what are those even?

These should be left alone. These are rules that can be matched only exactly and that are about licenses but are NOT license notices or texts. They should be used sparingly as a last resort. For instance, this text:

copyright info have been adapted to avoid the violation of the GPL license

is NOT a GPL-related notice, but some commentary about the GPL license and would be a typical case for a "false positive" rule.

Does it add any accuracy?

It should, but it may also degrade and miss some matches in a few corner cases. These could be caught separately by the --unknow-license option though.

Does it pass integration tests?

It will likely make some fail.

Any others that I have missed?

I think the approach could be refined using a Python script as we have code that handle the RULEs and has all SPDX licenses alright and we could also expand this to a few more things:

SPDX id alright as you suggest
the license short and long name
potentially the SPDX names, and URL

Some scripts examples to use as a base are in https://github.com/nexB/scancode-toolkit/blob/develop/etc/scripts/licenses/

pombredanne commented 2 years ago

@vargenau Another case to track here https://github.com/nexB/scancode-toolkit/issues/2905

borisbaldassari commented 2 years ago

Hi Philippe, all,

First things first: thanks for the good work people! You're great! I'd like to contribute some feedback regarding false positives (/wrong license) we get at the Eclipse Foundation. One of our main issues is with the canonical headers used with the EPL-2.

/*********************************************************************
 * Copyright (c) 2019 Red Hat, Inc.
 *
 * This program and the accompanying materials are made
 * available under the terms of the Eclipse Public License 2.0
 * which is available at https://www.eclipse.org/legal/epl-2.0/
 *
 * SPDX-License-Identifier: EPL-2.0
 **********************************************************************/

Some lines (typically {4,5}) are recognised as LicenseRef-scancode-unknown-license-reference even with the SPDX tag sitting right behind. It seems that another license text (license-intro_29.RULE) is matched before the EPL-2.0 text, so even adding various variations of the headers (line ends, etc.) doesn't help. Setting the license-score to 100 helps a bit (i.e. less wrong violations), but still not enough to make this case go away.

Would it be helpful to provide a list of false positives / wrong identifications? I'll be happy to provide one if so.

AyanSinhaMahapatra commented 2 years ago

@borisbaldassari Thanks for your feedback and report.

We are working on this and this specific issue of a license_intro being present before detections is going to be fixed, this is WIP and hasn't landed yet.

Would it be helpful to provide a list of false positives / wrong identifications? I'll be happy to provide one if so.

This would be extremely helpful, we will use this for testing this new feature extensively, as we are using the other lists contributed here. Thanks a lot!

borisbaldassari commented 2 years ago

Hi @AyanSinhaMahapatra Thanks for the head-up! I'll wait for the landing and give it a try. :-)

Please find below a list of unknown-license false positives found in a few Eclipse projects (Che, JGit, CDT, Tycho). If needed I can analyse more projects -- but since we're using ORT we don't have direct access to the scancode output, so I need to run it separately (and manually).

scancode_fp_eclipse.tar.gz

borisbaldassari commented 2 years ago

Please also find attached the Python script used to generate the csv's, if it's useful.

extract_false_positives.py.tar.gz

pombredanne commented 2 years ago

@borisbaldassari Thank you ++

pombredanne commented 2 years ago

Another short SSPL false positive https://github.com/nexB/scancode-toolkit/issues/2975

PatteSI commented 1 year ago

As @pombredanne also asked in my initial issue for a list of false positives I just wanted to mention that the ORT community also started sharing curations for false positives. I guess ScanCode ist still one of the most widely used scanning component in ORT so they might all be relevant for you. Check out their curantions and package configurations: https://github.com/oss-review-toolkit/ort-config

aboutcode-org / scancode-toolkit