aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 537 forks source link

Support SPDX tag-value as an output format #338

Closed sschuberth closed 7 years ago

sschuberth commented 7 years ago

It would be great if the SPDX tag-value format would be supported as an output format, see this example.

pombredanne commented 7 years ago

@sschuberth this makes perfect sense. (FWIW, I am after all one of the co-founders of SPDX: http://web.archive.org/web/20160617012420/https://spdx.org/about-spdx/what-is-spdx )

That said, @ah450 wrote this Python library for SPDX for the Google Summer of Code a few yeasr back (I was the mentor). See:

This is not up entirely to date with the latest spec and would likely require some work first but is a great base to start from.

And the general idea would be to have probably an external utility that takes scan results (or more generally the upcoming ABC data format to be pushed soo to https://github.com/nexB/aboutcode ) and converts that to SPDX (and the possibly the other around too).

@sschuberth Would you have some cycles to chip in may be? @ah450 Do you think you could update your library to support the latest SPDX spec, at least the tag-value format?

sschuberth commented 7 years ago

@pombredanne I'd love to help, but don't have the time currently. Anyway it's good to see that you acknowledge this to be a meaningful feature.

pombredanne commented 7 years ago

@sschuberth no problem. In any case this is definitely in the roadmap: I just updated it: https://github.com/nexB/scancode-toolkit/wiki/Roadmap#data-exchange

sschuberth commented 7 years ago

@pombredanne I'm currently thinking about implementing this as a post-processing step to JSON output (similar to json2csv.py), not as a strictly separate format. That would have the advantage of being able to convert existing JSON files to SPDX files (if we expose the conversion via command line options). What do you think?

pombredanne commented 7 years ago

@sschuberth This makes perfect sense :+1:

sschuberth commented 7 years ago

@pombredanne I've started to add some very basic SPDX tag / value output support based on https://github.com/spdx/tools-python/issues/9 to ScanCode. Locally, I've simply copied the spdx directory into ScanCode's src directory. But in order to submit my changes to ScanCode we need to come up with something better. I guess working on https://github.com/spdx/tools-python/issues/2 to publish the spdx-tools to PyPI is the way to go?

pombredanne commented 7 years ago

@sschuberth this is great! let me push this to Pypi tomorrow. The spdx name has already been used unfortunately .... https://pypi.python.org/pypi/spdx/ so I suggest using spdx-python as a name.

sschuberth commented 7 years ago

I guess you mean spdx-tools like mentioned in the linked issue? I at least would prefer that name.

pombredanne commented 7 years ago

Actually spdx-tools is already the name... correct and a better one.

pombredanne commented 7 years ago

@sschuberth actually this is done: https://pypi.python.org/pypi/spdx-tools/ :P

pombredanne commented 7 years ago

released on Pypi after a version bump to v0.3

sschuberth commented 7 years ago

Thanks @pombredanne, however I'm having problems taking to package into use. After adding spdx-tools to the # scancode and AboutCode section in setup.py I did

$ ./configure clean
$ ./configure

and I get and error from pip:

Could not find a version that satisfies the requirement spdx-tools

I also tried with an underscore instead of the dash:

Could not find a version that satisfies the requirement spdx_tools

Do I need to update a local pip cache or something?

Edit: Interestingly, directly running

$ pip install spdx-tools

did work.

pombredanne commented 7 years ago

@sschuberth the deal is that ScanCode is designed to run as-is direct from a checkout or archive and it must do so without fetching ANYTHING else over the network.

Therefore the configure script actually wraps calls to pip to ensure that no network connection is established and nothing is fetched remotely from Pypi at all.

This is why a plain pip install will work while configure may not unless you have properly vendored all the dependencies (transitively, all the way) in the thirdparty directory.

So what needs to be done is (and you have done some of it):

The .ABOUT files are used to collect eventually a proper inventory of third-party deps and generate an attribution notice for all the deps using https://github.com/nexB/attributecode .... though I am not doing this yet for ScanCode proper and I have also played/strayed a bit with experiments on not-yet-supported data formats for the .ABOUT files ....

That said you could use the Jinja dep as an example in https://github.com/nexB/scancode-toolkit/tree/develop/thirdparty/prod:

It may feel weird to commit binaries to a repo.... But this has many advantages in the context of ScanCode as a ready-to-run app... e.g. there is no build step needed to run, everything is self-contained and every dep is thoroughly tested exactly because it is vendored.

I hope it clears things up. If you are not comfy with this, I can add the spdx-tools dep in the devlop branch alright

sschuberth commented 7 years ago

Hmm, I do see the point in having a self-contained ScanCode out of the box. But doing all the checkbox items manually does not seem right to me. Can't we have a script around pip download that resolves all transitive dependencies and puts them to thirdparty/prod, commits them, and be done?

pombredanne commented 7 years ago

@sschuberth that would be a great helper.... and eventually it could invoke scancode to actually collect licenses and generate a .ABOUT file (which is essentially a YAML-formatted scan-like piece of data) .... but there is no such thing like that yet ...

To get the deps, I usually run something like this: pip wheel spdx-tools --wheel-dir=thirdparty/prod or rather to also get the download URLs use: pip --verbose wheel spdx-tools --wheel-dir=thirdparty/prod | grep "Downloading from URL" which would yield this relevant log:

  Downloading from URL https://pypi.python.org/packages/d9/85/d6ef92c78efd1440f42fe0e3df6ca1e838d8b75d9a249968778bc5c2040f/spdx_tools-0.3-py2.py3-none-any.whl#md5=29cdb6e3167742638d4d27c9cda7077e (from https://pypi.python.org/simple/spdx-tools/)
  Downloading from URL https://pypi.python.org/packages/9d/fa/4198e8d8b444a4ace5c8fd86d128c2faa210a6e281973c8e5e16d978eaf4/rdflib-4.2.1.tar.gz#md5=528adaa10536d14a608507d7831711f5 (from https://pypi.python.org/simple/rdflib/)
  Downloading from URL https://pypi.python.org/packages/a8/4d/487e12d0478ee0cbb15d6fe9b8916e98fe4e2fce4cc65e4de309209c0b24/ply-3.9.tar.gz#md5=c5c5767376eff902617fd9874f0c76b7 (from https://pypi.python.org/simple/ply/)
  Downloading from URL https://pypi.python.org/packages/f4/5b/fe03d46ced80639b7be9285492dc8ce069b841c0cebe5baacdd9b090b164/isodate-0.5.4.tar.gz#md5=9da3ea2af54a6261d854e73d2266030e (from https://pypi.python.org/simple/isodate/)
  Downloading from URL https://pypi.python.org/packages/2b/f7/e5a178fc3ea4118a0edce2a8d51fc14e680c745cf4162e4285b437c43c94/pyparsing-2.1.10-py2.py3-none-any.whl#md5=5e707ac42995e52ae06df4325ac07ebb (from https://pypi.python.org/simple/pyparsing/)
  Downloading from URL https://pypi.python.org/packages/03/3e/e2aa257943384125d1ef0490427da2cde9d5152e3341f1b3a50bdcce0f37/SPARQLWrapper-1.8.0.zip#md5=f1fc6d410f387610a254b83f2520f22e (from https://pypi.python.org/simple/sparqlwrapper/)
  Downloading from URL https://pypi.python.org/packages/f7/71/a96f36d34394bcfff9fb54bfe0aa72cc5b4ff2f803e5728645aef38f7aee/html5lib-0.999999999-py2.py3-none-any.whl#md5=c94780c55ea28529c39ad19ed372d629 (from https://pypi.python.org/simple/html5lib/)
  Downloading from URL https://pypi.python.org/packages/c8/0a/b6723e1bc4c516cb687841499455a8505b44607ab535be01091c0f24f079/six-1.10.0-py2.py3-none-any.whl#md5=3ab558cf5d4f7a72611d59a81a315dc8 (from https://pypi.python.org/simple/six/)
  Downloading from URL https://pypi.python.org/packages/69/19/b1dff551058ce79d88b1e3688f1c735590d7ddf44d10681512133b35019f/setuptools-32.3.1-py2.py3-none-any.whl#md5=9fe4e32f20a9b13c206c1bdc4c9feaf4 (from https://pypi.python.org/simple/setuptools/)
  Downloading from URL https://pypi.python.org/packages/c3/e5/74d05eed73b09752ac3dc4a8a69ae92ffa1ce92fcb03eaa624d1fcd17e33/webencodings-0.5.tar.gz#md5=878714d45241f7970dffd8991d61fff9 (from https://pypi.python.org/simple/webencodings/)

These can then be used to craft the .ABOUT files.

The deps are corresponding roughly to this dependencies resolution:

spdx_tools-0.3-py2.py3-none-any.whl requires:
    ply-3.9-py2.py3-none-any.whl
    rdflib-4.2.1-cp27-none-any.whl

which in turn require these:
        html5lib-0.999999999-py2.py3-none-any.whl
        isodate-0.5.4-cp27-none-any.whl
        pyparsing-2.1.10-py2.py3-none-any.whl
        setuptools-32.3.1-py2.py3-none-any.whl
        six-1.10.0-py2.py3-none-any.whl
        SPARQLWrapper-1.8.0-py2-none-any.whl
        webencodings-0.5-cp27-none-any.whl

meaning this also performs a forced update on setuptools which is stored in thirdparty/base rather than prod

And then I would go to create the license and ABOUT files (meaning eventually the code of each wheel needs to be scanned itself and the proper license and copyright determined/analyzed)

I can handle this in a snap but for now there is nothing that is automated (and here at least it sounds there is no native code: if there were, then the wheels would have to be built on Linux, Mac and Windows, possibly using my build loops here: https://github.com/pombreda/thirdparty and https://github.com/pombreda/thirdparty-manylinux/ to help)

sschuberth commented 7 years ago

@pombredanne I already started doing this as an exercise. Expect another PR from me today ;-)

pombredanne commented 7 years ago

@sschuberth thanks mucho... I guess the painful part is rdflib and all its deps... I am not sure I enjoy RDF too much. I forgot to mention that I usually mention all the transitive deps as requirements in setup.py

sschuberth commented 7 years ago

Thinking about it, are you also fine with me creating the .ABOUT files manually from inspecting the PyPI meta-data, or do I really need to run ScanCode on each package's source code?

pombredanne commented 7 years ago

@sschuberth Pypi metadata are fine for a start.

sschuberth commented 7 years ago

@pombredanne I was running into a few issues creating the ABOUT files for some packages (like SPARQLWrapper) because they offer no .whl download, yet the ABOUT file should describe the locally generated .whl file. I was not sure how to document that, so I've simply pushed the PR as it is now to improve it incrementally.

pombredanne commented 7 years ago

@sschuberth that works and I will update accordingly. When there is no prebuilt wheels (and when the built wheel include native code) then I also vendor the corresponding source distribution and use that as the download URL for the wheel .ABOUT file. For the case of native code this ensure that the corresponding source code is safely kept too.

pombredanne commented 7 years ago

I think there is still a need for a few tests and may be extra docs, so I am leaving this open for now.

pombredanne commented 7 years ago

441 has been merged and fixed #436

pombredanne commented 7 years ago

Test added in #637 .... closing