aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 536 forks source link

Scancode confuses files with similar names #3648

Closed bennati closed 5 months ago

bennati commented 7 months ago

Scanning a folder containing only two files COPYING and COPYING.LGPLv2.1 (see files.zip )

Scancode detects license LGPL in file COPYING at line 502, but COPYING has only 66 lines while COPYING.LGPLv2.1 has 502.

Scancode output is in the attachment.

This issue can be reproduced with Scancode 32.0.8 but not with 31.2.6

pombredanne commented 7 months ago

This is a really surprising bug! Thanks for the report

AyanSinhaMahapatra commented 7 months ago

@bennati @pombredanne We have the following lines in COPYING:

    The following license texts are included in the following files:
      - COPYING.LGPLv2.1: GNU Lesser General Public License version 2.1
      - COPYING.GPLv2: GNU General Public License version 2
      - COPYING.GPLv3: GNU General Public License version 3

which is matched to https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/lgpl-2.1_and_gpl-2.0_and_gpl-3.0_3.RULE and references the other COPYING files.

And it brings in the license detected at COPYING.LGPLv2.1 present here.

See scan results below (scanned with latest SCTK develop) copying-files.json

See the match in question:

            {
              "license_expression": "lgpl-2.1",
              "spdx_license_expression": "LGPL-2.1-only",
              "from_file": "copying-files/COPYING.LGPLv2.1",
              "start_line": 1,
              "end_line": 502,
              "matcher": "1-hash",
              "score": 100.0,
              "matched_length": 4288,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "lgpl-2.1.LICENSE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/lgpl-2.1.LICENSE",
            }

it mentions "from_file": "copying-files/COPYING.LGPLv2.1", which tells us this match has been taken from the respective file as that was referenced.

The from-file attribute was added recently at https://github.com/nexB/scancode-toolkit/pull/3620 to help make referenced matches more verbose. See also related issue: https://github.com/nexB/scancode-toolkit/issues/3547

pombredanne commented 7 months ago

@AyanSinhaMahapatra so if I may recap, you are saying that the detection is correct, and that the latest version helps better understand the detection details?

AyanSinhaMahapatra commented 7 months ago

Yes @pombredanne :) That's correct

bennati commented 7 months ago

Thank you guys, I opened an issue in the ORT project to support this new feature https://github.com/oss-review-toolkit/ort/issues/8190

sschuberth commented 7 months ago

it mentions "from_file": "copying-files/COPYING.LGPLv2.1", which tells us this match has been taken from the respective file as that was referenced.

I have to say that I find the implementation to be very confusing from a user perspective. There are plenty of tools out there that parse ScanCode result JSON files, and so far all of them could have relied on lines from matches to belong to the file / path they are nested under. But now, suddenly, additional info is added that easily goes unnoticed and basically says "eh, wait, this match actually comes from a different file despite being listed here"!

Also, could you explain the general idea behind such references a bit more? Where's the benefit to refer from COPYING to COPYING.LGPLv2.1 if the latter also gets scanned on its own? Are we then actually getting duplicate findings for COPYING.LGPLv2.1, once as a reference as part of scanning COPYING, and then again from scanning COPYING.LGPLv2.1 directly?

sschuberth commented 7 months ago

Oh, and on top of that, it seems there currently (with ScanCode 32.0.8) is no way to distinguish "real findings" from "reference findings" as there is no from_file field yet, correct?

AyanSinhaMahapatra commented 7 months ago

@sschuberth could you kindly explain more what is confusing here?

License references to other files are pretty common and encountered widely across codebases. And there are generally a few cases and see #3547 for more details on this.

  1. In a lot of cases there is no information about the license at the location where another file is referenced. This is an unknown license reference, and if we can successfully find the referenced file and detect some license there (let's say anything apache-2.0) we carry over the license matches and also make the license expression to be apache-2.0 and not unknown-license-reference.

  2. Also in some cases we have information about the license both in the location where a file is referenced and the referenced file. Now there could be a mismatch between these two information, the file could say, licensed under gpl, see LICENSE but at LICENSE we get a different/more licenses.

In both these cases, it is important to go to the referenced file and check for license information there, and adjust the license expression accordingly where it was referenced. Now when we are changing a license expression based on results in some other file (which is extremely important as this can be unknown/wrong, and a major source of false positives we get reports for), we need to have some clue that something was modified based on results in another file. Otherwise it would be more confusing where we detected the license expression at.

So this is an improvement over previous versions where we were not following references, and we are also providing clues on how and from where we got the license matches from.

If you're not interested in these matches, or only want to show matches which are from the same file, you can filter based on from_file values and the path for a file.

This feature of following references to resolve cases like these has been there for more than two years now, even before the recent license detection upgrade and output format change in v32. See https://github.com/nexB/scancode-toolkit/pull/2616 where the initial implementation was added. Now we just provide more helpful info on which file this originated from, so that downstream tools can use this to show origin properly. It would have been nice to include this right at the start when we implemented following references, but we didn't get any feedback on this then.

AyanSinhaMahapatra commented 7 months ago

There is also some discussion with lot more details on https://github.com/nexB/scancode-toolkit/issues/3547 where we are considering when to carry over referenced matches and when not to. Please chime in there too if you want to give feedback.

AyanSinhaMahapatra commented 7 months ago

And this from_file attribute was added based on feedback at https://github.com/nexB/scancode.io/issues/902 to enable downstream tools like SCIO to show license match origin correctly.

sschuberth commented 7 months ago

In a lot of cases there is no information about the license at the location where another file is referenced.

Agreed. Hence I'm doubting the usefulness of tracking the place where a reference is being made if the target of the reference is included in the scan.

Also in some cases we have information about the license both in the location where a file is referenced and the referenced file.

I was wondering why you say only "in some cases", but I guess you have cases in mind where the reference points to a file that is not included in the scan like "see license at http://my.server/LICENSE.txt". In such a case I see how it could be useful track this is a reference findings to an unknown license. While ScanCode may not look at the contents of http://my.server/LICENSE.txt, a human inspecting the scan results might do so.

If you're not interested in these matches, or only want to show matches which are from the same file, you can filter based on from_file values and the path for a file.

I was preparing to do that, but that does not work with results generated by ScanCode 32.0.8 as it does not provide the from_file field yet, right?

This feature of following references to resolve cases like these has been there for more than two years now

Interesting. So we probably have quite a few wrongly associated license findings in our databases, and it just didn't occur to us until now, where we ran into a case where the end line was exceeded.

And this from_file attribute was added based on feedback at nexB/scancode.io#902 to enable downstream tools like SCIO to show license match origin correctly.

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

sschuberth commented 6 months ago

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

@AyanSinhaMahapatra can you confirm my statement here, that before the addition of from_file, there was no way for parsers of the ScanCode JSON result file to understand that a finding does not actually come from the file it is associated with?

AyanSinhaMahapatra commented 6 months ago

@sschuberth sorry about the late reply:

Agreed. Hence I'm doubting the usefulness of tracking the place where a reference is being made if the target of the reference is included in the scan.

The usefulness is that we are resolving unknown detections for license references, to their intended license declaration, and tracking how we came to the conclusion, in a variety of cases. So, lots of reduced false positives. We cannot get rid of these kinds of unknowns any other way, short of not reporting anything at all, and since these could be in key files, package manifests etc., which determine the primary license for a package, it would be misleading to not report the declared license.

but I guess you have cases in mind where the reference points to a file that is not included in the scan like "see license at http://my.server/LICENSE.txt"

Not exactly.

  1. These are license references which are not unknown. An example is the issue in question, where we have license information properly detected both at COPYING (plus a license reference is there) and COPYING.LGPLv2.1 (the file referred to) but we still chose to follow and check whether this is consistent. This is essentially what is discussed in https://github.com/nexB/scancode-toolkit/issues/3547 and something which is being discussed more before being implemented.

  2. We are planning to support license references to files which are not present in the codebase scaned, rather somewhere in the internet, like: http://github.com/nexB/license-expression/LICENSE.txt. We already support some of these which are popular by making the URL a license rule directly, but not everything can be added like this. Another case is nuget, where the license field for the packages sometimes has a reference to a file/webpage somewhere (we already detect the SPDX expressions there allright: https://github.com/nexB/scancode-toolkit/issues/3037). But this would be as a step in scancode.io, or as an optional step in SCTK as this would require making network calls.

I was preparing to do that, but that does not work with results generated by ScanCode 32.0.8 as it does not provide the from_file field yet, right?

Yeah, this is merged in develop and we are working towards a release: v32.1 sooner than later.

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

Yeah that's correct, we have not had any feedback to point in this direction in the couple years since it's release. When we had feedback, we implemented it. And this was an improvement anyway, which we are building upon to reduce false positives.

can you confirm my statement here, that before the addition of from_file, there was no way for parsers of the ScanCode JSON result file to understand that a finding does not actually come from the file it is associated with?

I mean above that there is no straightforward/easy way to do this, but it could be done potentially with rescanning the same scan target, if it's possible to do that. One would have to check the associated rule for each match, and if there are any referenced_filenames for that rule. And for a scan if there referenced filenames in any of the detected matches, and that file is present in the codebase, do a rescan for the same and use the new results.

sschuberth commented 6 months ago

it could be done potentially with rescanning the same scan target, if it's possible to do that.

Nah, that would be too involving, I believe.

Thanks for your answers and insights!

sschuberth commented 5 months ago

Let's close this now that the ScanCode 32.1.0 release as the from_file available which allows us to disambiguate the findings that actually stem from other files.

fviernau commented 5 months ago

@AyanSinhaMahapatra, @sschuberth would it make sense to you to add a CLI option to toggle the feature for resolving the references?

sschuberth commented 5 months ago

@AyanSinhaMahapatra, @sschuberth would it make sense to you to add a CLI option to toggle the feature for resolving the references?

I'm not sure, as changing that CLI option would invalidate the scan storage on the ORT side. So I'd probably rather continue filtering the results like we do.

fviernau commented 5 months ago

I'm not sure, as changing that CLI option would invalidate the scan storage on the ORT side. So I'd probably rather continue filtering the results like we do.

I believe that these two approaches lead to different scan results. If the feature flag was added we would keep the "license-ref" findings, while with the filtering neither the license ref findings nor the inlined findings are present. I could imagine that some users might want to have findings for such references - and then a toggle probably could indeed make sense.

sschuberth commented 5 months ago

I believe that these two approaches lead to different scan results.

You lost me now. Anyway, I feel this is a discussion we should have in the ORT community, not at the ScanCode side.

sschuberth commented 5 months ago

@AyanSinhaMahapatra as part of a discussion in ORT it came up that we now seem to have at least two kinds of reference findings for licenses in ScanCode:

  1. Those that are identified by the new from_file field containing something different than path (modulo this issue).
  2. References that contain "reference" as part of the LicenseRef name (AFAIK the only one is LicenseRef-scancode-unknown-license-reference).

How do these relate to each other? In particular, do 2. also contain from_file different from path, or are these exclusively for references external to the code base? Or phrased differently, if we take only findings where from_file == path, is there a risk to miss LicenseRef-scancode-unknown-license-reference findings?

Thanks for any insights.