aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 549 forks source link

Scan is not always providing valid package filenames #1880

Open DennisClark opened 4 years ago

DennisClark commented 4 years ago

I scanned libmzq-4.3.2 using scancode-toolkit-develop from 2020-01-16. The scan results include this:

"consolidated_packages": [
    {
      "type": "nuget",
      "namespace": null,
      "name": "libzmq-vc120",
      "version": "4.2.3.0",
      "qualifiers": {},
      "subpath": null,
      "primary_language": null,
      "description": "The 0MQ lightweight messaging kernel, with tweetnacl integrated, packaged for specific Visual Studio compiler.\nThe 0MQ lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products. 0MQ sockets provide an abstraction of asynchronous message queues, multiple messaging patterns, message filtering (subscriptions), seamless access to multiple transport protocols and more.",
      "release_date": null,
      "parties": [
        {
          "type": null,
          "role": "author",
          "name": "libzmq contributors",
          "email": null,
          "url": null
        },
        {
          "type": null,
          "role": "owner",
          "name": "Eric Voskuil",
          "email": null,
          "url": null
        }
      ],
      "keywords": [],
      "homepage_url": "https://github.com/zeromq/libzmq",
      "download_url": null,
      "size": null,
      "sha1": null,
      "md5": null,
      "sha256": null,
      "sha512": null,
      "bug_tracking_url": null,
      "code_view_url": null,
      "vcs_url": null,
      "copyright": "GNU Lesser GPL v3",
      "license_expression": "unknown",
      "declared_license": "https://raw.github.com/zeromq/libzmq/master/COPYING.LESSER",
      "notice_text": null,
      "root_path": "libzmq-4.3.2/packaging/nuget",
      "dependencies": [],
      "contains_source_code": null,
      "source_packages": [],
      "purl": "pkg:nuget/libzmq-vc120@4.2.3.0",
      "repository_homepage_url": "https://www.nuget.org/packages/libzmq-vc120/4.2.3.0",
      "repository_download_url": "https://www.nuget.org/api/v2/package/libzmq-vc120/4.2.3.0",
      "api_data_url": "https://api.nuget.org/v3/registration3/libzmq-vc120/4.2.3.0.json",
      "identifier": "pkg_nuget_libzmq_vc120_4_2_3_0_1",
      "consolidated_license_expression": "gpl-3.0 AND mit AND unknown",
      "consolidated_holders": [],
      "consolidated_copyright": "Copyright (c) ",
      "core_license_expression": "unknown",
      "core_holders": [],
      "other_license_expression": "gpl-3.0 AND mit",
      "other_holders": [],
      "files_count": 6
    }
  ]

If you use the downoload_url provided in the scan results, you get this file: libzmq-vc120.4.2.3.nupkg but that filename value is not to be found in the original scan results.
There is no obvious, reliable way to derive that filename from the scan results, which is unfortunate if you are trying to use the consolidated package info from the scan itself. Is there a way that scancode-toolkit can provide the correct package filename? libzmq-4.3.2.tar.gz libzmq-4.3.2.json.zip

pombredanne commented 4 years ago

@DennisClark there is no guaranteed relationship between the package data collected from a scan and the actual files beeing scanned... that's not scancode fault but that's driven by the context.

Here we parse the package manifest (a .nuspec for a NuGet) which is here: https://github.com/zeromq/libzmq/blob/v4.3.2/packaging/nuget/package.nuspec The thing is that https://github.com/zeromq/libzmq is a source repository and it does not contains the actual built NuGet. ScanCode infers that the corresponding standard repository_download_url is at https://www.nuget.org/api/v2/package/libzmq-vc120/4.2.3.0 from that data. and that seems correct to me. At that URL, the libzmq-vc120.4.2.3.nupkg is a built binary NuGet package (a zip archive) that was built using the .nuspec above as an "build script" and it would contain a few files (such as the .nuspec) that exist also in the source repo... but the key DLLs and executables compiled from the source code would rarely be in the source repo and found only in the nupkg.

I hope my explanation makes some sense and is not too contrived!

pombredanne commented 2 years ago

in this nuget case, the download URL is redirected to https://globalcdn.nuget.org/packages/libzmq-vc120.4.2.3.nupkg SCTK may not be able to figure that out unless we add a way where SCTK can make online network calls, but ScanCode.io would likely be OK to do such thing in a pipeline. Alternatively we could always infer a package archive filename and add this as a new attribute for a package?

Here libzmq-vc120.4.2.3.nupkg could be derived from the details we get in https://www.nuget.org/api/v2/package/libzmq-vc120/4.2.3.0