aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 545 forks source link

OR identified as AND #3523

Open hesa opened 1 year ago

hesa commented 1 year ago

Description

Looking at Cairo 1.16 and the following file https://cgit.freedesktop.org/cairo/tree/src/cairo-analysis-surface-private.h?h=1.16.0

The file has the following license text at the top:

..... snip
 * This library is free software; you can redistribute it and/or                                                                                                                                                   
 * modify it either under the terms of the GNU Lesser General Public                                                                                                                                               
 * License version 2.1 as published by the Free Software Foundation                                                                                                                                                
 * (the "LGPL") or, at your option, under the terms of the Mozilla                                                                                                                                                 
 * Public License Version 1.1 (the "MPL").                                                                                                                                                                  
.... snip

The interesting part is "GNU Lesser General Public License version 2.1 as published by the Free Software Foundation (the "LGPL") or, at your option, under the terms of the Mozilla Public License Version 1.1 (the "MPL"). "

Reported license Scancode version 31.2.6: "license_expressions": [ "lgpl-2.1 OR mpl-1.1" ]

Reported license Scancode version 32.0.6:

      "detected_license_expression": "lgpl-2.1 AND mpl-1.1",
      "detected_license_expression_spdx": "LGPL-2.1-only AND MPL-1.1",

So, Scancode 32.0.6 does not correctly identify "or, at your option" as "OR".

How To Reproduce

Download and unpack https://www.cairographics.org/releases/cairo-1.16.0.tar.xz

My settings/command line args for 32.0.6:

  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "31.2.6",
      "options": {
        "input": [
          "cairo-1.16.0"
        ],
        "--classify": true,
        "--copyright": true,
        "--email": true,
        "--info": true,
        "--json-pp": "cairo-1.16.0-scan.json",
        "--license": true,
        "--license-clarity-score": true,
        "--license-text": true,
        "--license-text-diagnostics": true,
        "--package": true,
        "--processes": "16",
        "--summary": true
      },
....snip

My settings/command line args for 32.0.6:

  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "32.0.6",
      "options": {
        "input": [
          "cairo-1.16.0"
        ],
        "--classify": true,
        "--copyright": true,
        "--email": true,
        "--info": true,
        "--json-pp": "cairo-1.16.0-scancode-32.0.06-scan.json",
        "--license": true,
        "--license-clarity-score": true,
        "--license-text": true,
        "--license-text-diagnostics": true,
        "--package": true,
        "--processes": "16",
        "--summary": true
      },
 .... snip

System configuration

Scancode:

$ pip list | grep scancode
scancode-toolkit            32.0.6

OS:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
DennisClark commented 1 year ago

@AyanSinhaMahapatra I believe the issue as reported is correct: Scancode 32.0.6 does not correctly identify "or, at your option" as "OR"

DennisClark commented 1 year ago

It might be tricky, since it is very similar to the wording used for "or later" licenses.

pombredanne commented 1 year ago

@hesa there is a reference to SCTK 31.2.6 and then 32.0.6 above .... I get this with 32.0.6:

- matches:
    - score: '94.64'
      matcher: 3-seq
      end_line: 29
      rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_or_mpl-1.1_2.RULE
      start_line: 4
      matched_text: |
         * This library is free software; you can redistribute it and/or
         * modify it either under the terms of the GNU Lesser General Public
         * License version 2.1 as published by the Free Software Foundation
         * (the "LGPL") or, at your option, under the terms of the Mozilla
         * Public License Version 1.1 (the "MPL"). If you do not alter this
         * notice, a recipient may use your version of this file under either
         * the MPL or the LGPL.
         *
         * You should have received a copy of the LGPL along with this library
         * in the file COPYING-LGPL-2.1; if not, write to the Free Software
         * Foundation, Inc., 51 Franklin Street, Suite 500, Boston, MA 02110-1335, USA
         * You should have received a copy of the MPL along with this library
         * in the file COPYING-MPL-1.1
         *
         * The contents of this file are subject to the Mozilla Public License
         * Version 1.1 (the "License"); you may not use this file except in
         * compliance with the License. You may obtain a copy of the License at
         * http://www.mozilla.org/MPL/
         *
         * This software is distributed on an "AS IS" basis, WITHOUT WARRANTY
         * OF ANY KIND, either express or implied. See the LGPL or the MPL for
         * the specific language governing rights and limitations.
         *
         * The Original Code is the cairo graphics library.
         *
         * The Initial Developer of the Original Code is Keith Packard
      match_coverage: '94.64'
      matched_length: 212
      rule_relevance: 100
      rule_identifier: lgpl-2.1_or_mpl-1.1_2.RULE
      license_expression: lgpl-2.1 OR mpl-1.1
  identifier: lgpl_2_1_or_mpl_1_1-8f4efaf4-1022-63ec-f250-bae5e2a90bfd
  license_expression: lgpl-2.1 OR mpl-1.1

This is correct?? (but not perfect in my book, we will need an improve rule to ensure we match this exactly and no approximately)

hesa commented 1 year ago

This is weird ..... I get different results when scanning the single file compared to the entire project.

Below is the output from running a small script I wrote (see even further down below) to show the weirdness (I am not ruling out I have made a mistake).

About Scancode
==============

Scancode: 32.0.6

Scan results for file
=====================

Entire project: Cairo 1.16.0
----------------------------
lgpl-2.1 AND mpl-1.1  from rule: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_or_mpl-1.1_2.RULE
lgpl-2.1 AND mpl-1.1  from rule: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_alternative.RULE
lgpl-2.1 AND mpl-1.1  from rule: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mpl-1.1.LICENSE

Single file: cairo-analysis-surface-private.h
---------------------------------------------
lgpl-2.1 OR mpl-1.1  from rule: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_or_mpl-1.1_2.RULE

Diff between header files
=========================

sdiff
-----

md5
---
fdba194b1e7e431389d45a625fd13169  cairo-1.16.0/src/cairo-analysis-surface-private.h
fdba194b1e7e431389d45a625fd13169  single_file/cairo-analysis-surface-private.h

Script:

#!/bin/bash

FILE_URL="https://cgit.freedesktop.org/cairo/plain/src/cairo-analysis-surface-private.h?h=1.16.0"
CAIRO_URL="https://www.cairographics.org/releases/cairo-1.16.0.tar.xz"

TMP_DIR=Scancode-3523
mkdir -p ${TMP_DIR}
cd ${TMP_DIR}

do_scan()
{
    scancode --classify --copyright --email --info --json-pp ${1}-scan.json \
             --license --license-clarity-score --license-text \
             --license-text-diagnostics --package --processes 16 --summary $1

}

scan_project()
{
    curl -LJO ${CAIRO_URL}
    xz -d cairo-1.16.0.tar.xz
    tar xvf cairo-1.16.0.tar

    do_scan cairo-1.16.0
}

scan_file()
{
    mkdir single_file
    curl -LJO ${FILE_URL}
    mv cairo-analysis-surface-private.h single_file

    do_scan single_file
}

scan_project
scan_file

echo "About Scancode"
echo "=============="
echo
echo -n "Scancode: "
cat cairo-1.16.0-scan.json |  jq -r .headers[0].tool_version
echo 

echo "Scan results for file"
echo "====================="
echo
echo "Entire project: Cairo 1.16.0"
echo "----------------------------"
cat cairo-1.16.0-scan.json |  jq -r '.files[] | select(.path=="cairo-1.16.0/src/cairo-analysis-surface-private.h") | .license_detections[] | "\(.license_expression)  from rule: \(.matches[].rule_url)"'
echo

echo "Single file: cairo-analysis-surface-private.h"
echo "---------------------------------------------"
cat single_file-scan.json |  jq -r '.files[] | select(.path=="single_file/cairo-analysis-surface-private.h") | .license_detections[] | "\(.license_expression)  from rule: \(.matches[].rule_url)"'
echo

echo "Diff between header files"
echo "========================="
echo
echo "sdiff"
echo "-----"
sdiff -s "cairo-1.16.0/src/cairo-analysis-surface-private.h"  "single_file/cairo-analysis-surface-private.h"
echo
echo "md5"
echo "---"
md5sum "cairo-1.16.0/src/cairo-analysis-surface-private.h"  "single_file/cairo-analysis-surface-private.h"
echo
pombredanne commented 1 year ago

This is weird ..... I get different results when scanning the single file compared to the entire project.

@AyanSinhaMahapatra could this be because we are following (incorrectly) the license file references?

pombredanne commented 1 year ago

@AyanSinhaMahapatra the issue is clearly because we follow references. I have detailed the issues in https://github.com/nexB/scancode-toolkit/issues/3547 Here we have two referenced filenames and if present they are followed and replace the license choice incorrectly by a the referenced licenses.